Journal Search Engine

ISSN : 1229-3431(Print)
ISSN : 2287-3341(Online)

Journal of the Korean Society of Marine Environment and Safety Vol.21 No.3 pp.253-258
DOI : https://doi.org/10.7837/kosomes.2015.21.3.253

Estimating Suitable Probability Distribution Function for Multimodal Traffic Distribution Function

Sang-Lok Yoo^*, Jae-Yong Jeong^**, Jeong-Bin Yim^**†

^*Graduate school of Mokpo National Maritime University, Mokpo 530-729, Korea
^**Professor, Mokpo National Maritime University, Mokpo 530-729, Korea

* First Author : yoosangrok82@naver.com, 061-241-2750

Corresponding Author : jbyim@mmu.ac.kr 061-241-2750

Received May 14, 2015 Review June 12, 2015 Accepted June 26, 2015

Abstract

The purpose of this study is to find suitable probability distribution function of complex distribution data like multimodal. Normal distribution is broadly used to assume probability distribution function. However, complex distribution data like multimodal are very hard to be estimated by using normal distribution function only, and there might be errors when other distribution functions including normal distribution function are used. In this study, we experimented to find fit probability distribution function in multimodal area, by using AIS(Automatic Identification System) observation data gathered in Mokpo port for a year of 2013. By using chi-squared statistic, gaussian mixture model(GMM) is the fittest model rather than other distribution functions, such as extreme value, generalized extreme value, logistic, and normal distribution. GMM was found to the fit model regard to multimodal data of maritime traffic flow distribution. Probability density function for collision probability and traffic flow distribution will be calculated much precisely in the future.

Key Words : Probability distribution function , Multimodal , Gaussian mixture model , Normal distribution , Maritime traffic flow

초록

키워드 :

This article has been cited by 0 article in crossref

Cited-By

Funding:

Honam Sea Grant R&D Program

1.Introduction

Maritime traffic flow is affected by the volume of traffic, tidal current, wave height, and so on. Analyzing maritime traffic flow is very important in the perspective of evaluating for the hazard of each route and the collision probability. Therefore, estimating the probability density function(pdf) is crucial to enhance the safety of maritime traffic.

In previous research, Silveira et al.(2013) studied the collision probability and traffic pattern on the coast of Portugal, but they only drew a histogram of navigation speed and location distribution and calculated the number of traffic. Giuliana et al.(2013) estimated anomalies by applying Kernel density estimation to traffic density on the Italian coast. Fangliang et al.(2012) analyzed the elements like navigation speed and traffic distance in the waterway of Netherlands and Shanghai, and applied it to normal distribution and log-normal distribution function. Qiang et al.(2014) analyzed the characteristic of traffic by applying navigation speed in Singaporean channel to beta distribution and weibull distribution. Liu et al.(2013) examined the traffic flow with the normal distribution and exponential distribution function drawn by the distribution of traffic time and speed.

Some studies estimated the collision probability when the ship is in confronting or passing by applying it to normal distribution (Fujii et al., 1974). And the proximity toward a hazard, defined in AASHTO(American Association of State Highway and Transportation Officials) and the regulations of maritime traffic safety audit, was calculated based on the navigation distance to estimate the collision probability with normal distribution function (Yim, 2010; Yim and Kim, 2010; AASHTO, 2014). Normally, studies assume the probability density function of traffic vessels as normal distribution function. However, complex distribution data like multimodal are very hard to be estimated by using normal distribution function only, and there might be errors when other distribution functions except normal distribution function are used.

The GMM(Gaussian Mixture Model), combined with multiple normal distribution, is very useful to analyze very complex distribution like multimodal. The GMM has been used as an analyzing tool in various fields such as biology, economics, business administration, physics, astronomy, engineering, and so forth. Especially, GMM is used a lot in estimating the probability density function from multi-variate data(Ravindra et al., 2010; Gonzalez-Longatt et al., 2012). This study adopts GMM to examine frequency distribution of vessels and estimate parameter.

2.

2.1.Scope of Study Area

This study was conducted for 1 year, from January 1 to December 31, 2013, and used AIS observation data in Mokpo port. As shown in Fig. 1, study area is in Mokpogu that vessels are passing.

2.2.Procedure of Study

The process of this study is shown as Fig. 2. Vessels were classified into entry and departure, and the average position (34.7656°N, 126.2926°E) of vessels was set to the center point, and then calculated each distance between the location of each vessel and the center point.

To test goodness of fit, we applied chi-squared(χ² ) test. According to the result of χ² test, it was found that GMM is fit in this case, so we applied different type of GMM and selected the fit model with Akaike Information Criterion(AIC) and Bayesian information criterion(BIC). Desirable GMM was chosen from this process.

3.Estimation of Probability Distribution Function

3.1.Examining Probability Distribution Function for Test

There are various types of probability distribution function. Since this study indicates sample data x into the value of positive(+) and negative(-), such distributions that do not satisfy the condition of x>0 and 0≦x≦1 like gamma distribution and beta distribution are excluded. Given sample data were analyzed by using extreme value distribution(EV), generalized extreme value distribution(GEV), logistic distribution, normal distribution, and gaussian mixture model(GMM).

By using fitdist and fitgmdist functions in MATLAB(2014a), we drew Fig. 4. The formulae from (1) to (5) show the probability density function(pdf) for each 5 distribution function refer to MATLAB(MATLAB, 2014a; MATLAB, 2014b; MATLAB, 2014c).

f_ev (extreme value pdf for sample data x) can be described as formula (1).

f_{ev} (x |μ, σ) = \frac{1}{σ} e (\frac{x - μ}{σ}) e (- e (\frac{x - μ}{σ}))

(1)

Where μ and σ mean a location parameter and a scale parameter.

f_gev (generalized extreme value pdf for x) can be depicted as formula (2).

\begin{matrix} f_{gev} (x |ξ, μ, σ) = \frac{1}{σ} t {(x)}^{ξ + 1} e^{- 1 (x)} \\ t (x) = \{\begin{matrix} \begin{matrix} {(1 + (\frac{x - μ}{σ}) ξ)}^{- \frac{1}{ξ}} & \begin{matrix} if & ξ \neq 0 \end{matrix} \end{matrix} \\ \begin{matrix} e^{− \frac{x − μ}{σ}} & \begin{matrix} if & ξ = 0 \end{matrix} \end{matrix} \end{matrix} \end{matrix}

(2)

Where ξ means the shape parameter of pdf.

Also, f_logistic (logistic pdf for x) can be depicted as formula (3).

f_{log istic} (x |μ, σ) = \frac{e^{\frac{x - μ}{σ}}}{σ {(1 + e^{\frac{x - μ}{σ}})}^{2}}

(3)

f_normal (normal pdf for x) can be depicted as formula (4).

f_{normal} (x |μ, σ) = \frac{1}{σ \sqrt{2 π}} e^{- \frac{{(x - μ)}^{2}}{2 σ^{2}}}

(4)

And f_gmm (GMM pdf for x) can be depicted as formula (5).

f_{gmm} (x |c, μ, σ) = \sum_{m = 1}^{M} \frac{c_{m}}{σ_{m} \sqrt{2 π}} e [- \frac{1}{2} {(\frac{x - μ_{m}}{σ_{m}})}^{2}]

(5)

Where u_m and σ_m stand for the mean and standard deviation of gaussian distribution. c_m is the m^th mixture coefficient of gaussian distribution which means the radio of given data and the probability that one sample data is shown at m^th gaussian distribution.

3.2Goodness of Fit

The χ² values were compared to evaluate the GMM, EV, GEV, logistic, and normal distribution function. At first, divide the range of estimated distribution into k intervals, i.e., [a₀, a₁), [a₀, a₂), ⋯, [a_k-1, a_k), and then calculate each value, N_j (j=1, 2,⋯, k), for each interval to compute χ² test statistics. Where N_j means the number of X_i at j^th interval. Assuming that samples are in the designed distribution, p_j (the expected ratio of X_i at j^th interval) is calculated and test statistics is drawn by using formula (6)(Wikipedia, 2015).

χ^{2} = \sum_{j = 1}^{k} \frac{{(N_{j} - {np}_{j})}^{2}}{{np}_{j}}

(6)

For sample data outbound vessels in July, χ² for each distribution function is shown in Table 1, which shows GMM is outstanding since χ² of GMM is lower than those of other distribution functions. As shown in Fig 3, for the closeness to sample data GMM marks higher than other models to confirm GMM is fit to test.

3.3.Selecting suitable Gaussian Mixture Model and Estimating Parameter

The various gaussian mixture models were applied to select optimal model. Various GMM is described in Fig 4, where GMM2 means the mixture of 2 gaussian models, GMM3 of 3, GMM4 of 4, GMM5 of 5, GMM6 of 6 gaussian models.

However, the more gaussian models mixed, the more parameters outbound(July, 2013) created, so overfitting problem is raised. For this reason, formula (7) and (8) were used to calculate AIC and BIC which can solve overfitting problem.(Akaike, 1974; Schwarz, 1978).

AIC = - 2 ln L + 2 k

(7)

BIC = 2 ln L + k \cdot ln (n)

(8)

n:sample size

k: number of estimated parameters in the model

L: maximized value of the likelilhood function for the model

On the other hand, unnecessary models could be composed with a lot of parameters. However, penalty will be imposed in this case to prevent constituting complex model. It is so called the principle of parsimony. In the case of AIC, penalty is 2k, and the case of BIC, penalty is k·ln(n). So penalty of BIC is much harder than that of AIC since ln(n) is much larger than 2 when n is large. In these reasons, we adopted BIC.

The comparison between AIC and BIC to each GMM for the outbound vessels in July is shown in Table 2. When it’s considered with AIC criterion, GMM6 is desirable. However, due to the overfitting problem, we chose BIC as the criterion and selected GMM3 as the fit model.

GMM3 forms each gaussian distribution with the center of 256m, 27m, and 348m, and each ratio of data is 0.07, 0.59, and 0.35 respectively. It shows that data are clustered in u₂(27 m) and u₃(348 m) with large mixture coefficient of 0.59 and 0.35 respectively. Therefore, we can assume the parameters, u₁, u₂, ⋯, u_m, as commonly used mainly routes.Table 3

Table 4 and 5 show traffic data for each month with suitable GMM by BIC criterion and the parameter is classified into inbound and outbound vessels. From April to June, GMM4 was fit for both inbound and outbound vessels, and from October to December, GMM4 was fit for inbound vessels and GMM3 was fit for outbound vessels.

For sample data of inbound vessels in May, GMM4 is fit and it’s described in Fig. 5. Where μ forms gaussian distributions at –820 m, -77 m, 206 m, and 422 m, and each mixture coefficient(c) was 0.04, 0.08, 0.37, and 0.51 respectively.

The goal of modeling is to get suitable probability distribution which can well express the given sample’s distribution. In reality, however, it’s not too much to say that it’s impossible to describe sample distribution into one model. The alternative way is to use GMM which can approximate various data sets by using multiple gaussian distribution functions, so GMM is considered as the fit model for maritime traffic distribution.

4.Conclusions

Normal distribution is broadly used to assume probability distribution function. However, complex distribution data like multimodal are very hard to be estimated by using normal distribution function only, and there might be errors when other distribution functions including normal distribution function are used.

In this study, we experimented to find fit probability distribution function in multimodal area, by using AIS observation data gathered in Mokpo port for a year of 2013.

As the result of this study, GMM is the fittest model rather than other distribution functions, such as EV, GEV, logistic, and normal distribution. GMM was found to the fit model regard to multimodal data of maritime traffic flow distribution. Data were clustered in the mean(μ) which has large mixture coefficient(c ), so we can assume the parameters, u₁, u₂, ⋯, u_m, as commonly used mainly routes.

Probability density function for collision probability and traffic flow distribution will be calculated much precisely in the future. We hope this advance would help enhancement of navigation safety and vessel traffic services.Table 5

Figure

Fig. 1..

Scope of study area (Mokp port, Korea).

Fig. 2..

Study procedure to select the suitable traffic distribution function.

Fig. 4..

Various gaussian mixture model fitting.

Fig. 3..

Distribution fitting.

Fig. 5..

Gaussian mixture model fitting.

Table

Table 1..

X2of models

	EV	GEV	Logistic	Normal	GMM
X2	65.53	27.39	117.47	87.59	24.97

Table 2..

AIC & BIC of models

	GMM2	GMM3	GMM4	GMM5	GMM6
AIC	22855	22813	22803	22797	22793

BIC	22882	22857	22862	22873	22885

Table 3..

Model parameters of outbound(July, 2013)

Month	μ	σ	c
Jul	-256, 27, 348	90807, 31636, 8611	0.07, 0.59, 0.35

Table 4..

Type & parameters of GMM(inbound)

Month	Model	μ	σ	c
January	GMM3	-506, 138, 371	49836, 18334, 5391	0.03, 0.42, 0.55

February	GMM3	-491, 131, 353	49166, 19606, 5689	0.03, 0.39, 0.58

March	GMM3	-453, 139, 382	48398, 12698, 5332	0.03, 0.46, 0.51

April	GMM4	-801, -65, 167, 403	3251, 72003, 14619, 4650	0.03, 0.08, 0.40, 0.50

May	GMM4	-820, -77, 206, 422	1158, 78926, 18766, 5538	0.04, 0.08, 0.37, 0.51

June	GMM4	-832, -93, 203, 417	1301, 75950, 19262, 5168	0.04, 0.11, 0.40, 0.45

July	GMM3	-663, 128, 394	17829, 28514, 5980	0.02, 0.44, 0.54

August	GMM3	-544, 132, 387	44564, 24727, 5084	0.02, 0.43, 0.55

September	GMM3	-434, 114, 364	93227, 22399, 5479	0.02, 0.39, 0.59

October	GMM4	-406, 109, 304, 410	84949, 21094, 2652, 2077	0.03, 0.41, 0.31, 0.25

November	GMM4	-786, 126, 318, 436	19294, 24470, 3232, 2840	0.05, 0.39, 0.22, 0.34

December	GMM4	-883, -223, 134, 373	957, 90563, 18813, 5490	0.03, 0.05, 0.37, 0.55

Table 5..

Type & parameters of GMM(outbound)

Month	Model	μu	σ	c
January	GMM3	-263, 18, 330	81755, 19655, 8463	0.05, 0.55, 0.40

February	GMM3	-582, 42, 333	53887, 26608, 8000	0.02, 0.65, 0.33

March	GMM3	-305, 42, 346	77354, 20997, 9480	0.04, 0.61, 0.35

April	GMM4	-822, 2, 100, 386	2151, 43166, 24065, 7250	0.03, 0.29, 0.42, 0.26

May	GMM4	-846, -74, 102, 396	529, 70612, 26628, 7885	0.05, 0.15, 0.54, 0.26

June	GMM4	-858, -76, 83, 378	227, 70779, 24670, 8041	0.05, 0.14, 0.47, 0.34

July	GMM3	-256, 27, 348	90807, 31636, 8611	0.07, 0.59, 0.35

August	GMM3	-195, 15, 321	72002, 25428, 10571	0.14, 0.47, 0.40

September	GMM3	-319, -21, 298	106900, 29953, 10735	0.04, 0.46, 0.50

October	GMM3	-298, 8, 290	86916, 28847, 8442	0.05, 0.48, 0.47

November	GMM3	-903, 41, 359	169, 39418, 7984	0.02, 0.62, 0.36

December	GMM3	-915, 56, 372	217, 38698, 6903	217, 38698, 6903

Reference

AASHTO (2014) LRFD Bridge Design Specifications. Customary US. Units, pp.141-161
Akaike H (1974) A New Look at the Statistical Model Identification , IEEE Transactions on Automatic Control, Vol.19 (6) ; pp.716-723
Fangliang X , Ligteringen H , Gulijk CV , Ale B (2012) AIS Data Analysis for Realistic Ship Traffic Simulation Model , Proceedings of IWNTM' 2012, pp.44-49
Fujii Y , Yamanouchi H , Mizuki N (1974) A Study Factors Affecting the Frequency of Accidents in Marine Traffic , Journal of Navigation, pp.239-247
Giuliana P , Vespe M , Bryan K (2013) Vessel Pattern Knowledge Discovery from AIS Data : A Framework for Anomaly Detection and Route Prediction , Entropy, Vol.15; pp.2218-2245
Gonzalez-Longatt FM , Rueda JL , Erlich I , Bogdanov D , Villa W (2012) Identification of Gaussian Mixture Model using Mean Variance Mapping Optimization: Venezuelan Case , . 2012 3rd IEEE Pes Innovative Smart Grid Technologies Europe(ISGT Europe), pp.1-6
MATLAB (2014a) Programming. MATLAB version 8.3 (R2014a),
MATLAB (2014b) Statistical Toolbox. Fit Probability Distribution object to data. MATLAB Version 8.3(R2014a),
MATLAB (2014c) Statistical Toolbox. Fit Gaussian Mixture Distribution to data. MATLAB Version 8.3(R2014a),
Liu ZB , Fu YH , Cong YS (2013) The Simulation of Vessel Traffic Flow Based on Congruential Generator , International Conference on Remote Sensing. Environment and Transportation Engineering(RSETE), pp.179-182
Qiang M , Weng J , Li S (2014) Analysis of AIS-based Vessel Traffic Characteristics in the Singapore Strait , 93rd Annual Meeting of Transportation Research Board, pp.1-19
Ravindra S , Pal BC , Jabr RA (2010) Statistical Representation of Distribution System Loads Using Gaussian Mixture Model , IEEE Transactions ON Power Systems, Vol.25 (1) ; pp.29-37
Schwarz GE (1978) Estimating the Dimension of a Model , Annals of Statistics, Vol.6 (2) ; pp.461-464
Silveira PAM , Teixeira AP , Guedes Soares C (2013) Use of AIS Data to Characterise Marine Traffic Patterns and Ship Collision Risk off the Coast of Portugal , Journal of Navigation, Vol.66; pp.879-898
Yim JB (2010) Development of Collision Risk Evaluation Model Between Passing Vessel and Mokpo Harbour Bridge , Journal of Korean Navigation and Port Research, Vol.34 (6) ; pp.405-415
Yim JB , Kim DH (2010) Statistical Parameter Estimation to Calculate Collision Probability Between Mokpo Harbor Bridge and Passing Vessels , Journal of Korean Navigation and Port Research, Vol.34 (8) ; pp.609-614
Wikipedia (2015) Tutorial for Goodness of fit. http://www.en.wikipedia.org/wiki/Goodness_of_fit,