Journal Search Engine
Search Advanced Search Adode Reader(link)
Download PDF Export Citaion korean bibliography PMC previewer
ISSN : 1229-3431(Print)
ISSN : 2287-3341(Online)
Journal of the Korean Society of Marine Environment and Safety Vol.21 No.3 pp.253-258
DOI : https://doi.org/10.7837/kosomes.2015.21.3.253

Estimating Suitable Probability Distribution Function for Multimodal Traffic Distribution Function

Sang-Lok Yoo*, Jae-Yong Jeong**, Jeong-Bin Yim**
*Graduate school of Mokpo National Maritime University, Mokpo 530-729, Korea
**Professor, Mokpo National Maritime University, Mokpo 530-729, Korea

* First Author : yoosangrok82@naver.com, 061-241-2750

Corresponding Author : jbyim@mmu.ac.kr 061-241-2750
May 14, 2015 June 12, 2015 June 26, 2015

Abstract

The purpose of this study is to find suitable probability distribution function of complex distribution data like multimodal. Normal distribution is broadly used to assume probability distribution function. However, complex distribution data like multimodal are very hard to be estimated by using normal distribution function only, and there might be errors when other distribution functions including normal distribution function are used. In this study, we experimented to find fit probability distribution function in multimodal area, by using AIS(Automatic Identification System) observation data gathered in Mokpo port for a year of 2013. By using chi-squared statistic, gaussian mixture model(GMM) is the fittest model rather than other distribution functions, such as extreme value, generalized extreme value, logistic, and normal distribution. GMM was found to the fit model regard to multimodal data of maritime traffic flow distribution. Probability density function for collision probability and traffic flow distribution will be calculated much precisely in the future.


초록


    Honam Sea Grant R&D Program

    1.Introduction

    Maritime traffic flow is affected by the volume of traffic, tidal current, wave height, and so on. Analyzing maritime traffic flow is very important in the perspective of evaluating for the hazard of each route and the collision probability. Therefore, estimating the probability density function(pdf) is crucial to enhance the safety of maritime traffic.

    In previous research, Silveira et al.(2013) studied the collision probability and traffic pattern on the coast of Portugal, but they only drew a histogram of navigation speed and location distribution and calculated the number of traffic. Giuliana et al.(2013) estimated anomalies by applying Kernel density estimation to traffic density on the Italian coast. Fangliang et al.(2012) analyzed the elements like navigation speed and traffic distance in the waterway of Netherlands and Shanghai, and applied it to normal distribution and log-normal distribution function. Qiang et al.(2014) analyzed the characteristic of traffic by applying navigation speed in Singaporean channel to beta distribution and weibull distribution. Liu et al.(2013) examined the traffic flow with the normal distribution and exponential distribution function drawn by the distribution of traffic time and speed.

    Some studies estimated the collision probability when the ship is in confronting or passing by applying it to normal distribution (Fujii et al., 1974). And the proximity toward a hazard, defined in AASHTO(American Association of State Highway and Transportation Officials) and the regulations of maritime traffic safety audit, was calculated based on the navigation distance to estimate the collision probability with normal distribution function (Yim, 2010; Yim and Kim, 2010; AASHTO, 2014). Normally, studies assume the probability density function of traffic vessels as normal distribution function. However, complex distribution data like multimodal are very hard to be estimated by using normal distribution function only, and there might be errors when other distribution functions except normal distribution function are used.

    The GMM(Gaussian Mixture Model), combined with multiple normal distribution, is very useful to analyze very complex distribution like multimodal. The GMM has been used as an analyzing tool in various fields such as biology, economics, business administration, physics, astronomy, engineering, and so forth. Especially, GMM is used a lot in estimating the probability density function from multi-variate data(Ravindra et al., 2010; Gonzalez-Longatt et al., 2012). This study adopts GMM to examine frequency distribution of vessels and estimate parameter.

    2.

    2.1.Scope of Study Area

    This study was conducted for 1 year, from January 1 to December 31, 2013, and used AIS observation data in Mokpo port. As shown in Fig. 1, study area is in Mokpogu that vessels are passing.

    2.2.Procedure of Study

    The process of this study is shown as Fig. 2. Vessels were classified into entry and departure, and the average position (34.7656°N, 126.2926°E) of vessels was set to the center point, and then calculated each distance between the location of each vessel and the center point.

    To test goodness of fit, we applied chi-squared(χ2 ) test. According to the result of χ2 test, it was found that GMM is fit in this case, so we applied different type of GMM and selected the fit model with Akaike Information Criterion(AIC) and Bayesian information criterion(BIC). Desirable GMM was chosen from this process.

    3.Estimation of Probability Distribution Function

    3.1.Examining Probability Distribution Function for Test

    There are various types of probability distribution function. Since this study indicates sample data x into the value of positive(+) and negative(-), such distributions that do not satisfy the condition of x>0 and 0≦x≦1 like gamma distribution and beta distribution are excluded. Given sample data were analyzed by using extreme value distribution(EV), generalized extreme value distribution(GEV), logistic distribution, normal distribution, and gaussian mixture model(GMM).

    By using fitdist and fitgmdist functions in MATLAB(2014a), we drew Fig. 4. The formulae from (1) to (5) show the probability density function(pdf) for each 5 distribution function refer to MATLAB(MATLAB, 2014a; MATLAB, 2014b; MATLAB, 2014c).

    fev (extreme value pdf for sample data x) can be described as formula (1).

    f ev x μ , σ = 1 σ e x μ σ e e x μ σ
    (1)

    Where μ and σ mean a location parameter and a scale parameter.

    fgev (generalized extreme value pdf for x) can be depicted as formula (2).

    f gev x ξ , μ , σ = 1 σ t x ξ + 1 e 1 x t x = 1 + x μ σ ξ 1 ξ if ξ 0 e x μ σ if ξ = 0
    (2)

    Where ξ means the shape parameter of pdf.

    Also, flogistic (logistic pdf for x) can be depicted as formula (3).

    f log istic x μ , σ = e x μ σ σ 1 + e x μ σ 2
    (3)

    fnormal (normal pdf for x) can be depicted as formula (4).

    f normal x μ , σ = 1 σ 2 π e x μ 2 2 σ 2
    (4)

    And fgmm (GMM pdf for x) can be depicted as formula (5).

    f gmm x c , μ , σ = m = 1 M c m σ m 2 π e 1 2 x μ m σ m 2
    (5)

    Where um and σm stand for the mean and standard deviation of gaussian distribution. cm is the mth mixture coefficient of gaussian distribution which means the radio of given data and the probability that one sample data is shown at mth gaussian distribution.

    3.2Goodness of Fit

    The χ2 values were compared to evaluate the GMM, EV, GEV, logistic, and normal distribution function. At first, divide the range of estimated distribution into k intervals, i.e., [a0, a1), [a0, a2), ⋯, [ak-1, ak), and then calculate each value, Nj (j=1, 2,⋯, k), for each interval to compute χ2 test statistics. Where Nj means the number of Xi at jth interval. Assuming that samples are in the designed distribution, pj (the expected ratio of Xi at jth interval) is calculated and test statistics is drawn by using formula (6)(Wikipedia, 2015).

    χ 2 = j = 1 k N j np j 2 np j
    (6)

    For sample data outbound vessels in July, χ2 for each distribution function is shown in Table 1, which shows GMM is outstanding since χ2 of GMM is lower than those of other distribution functions. As shown in Fig 3, for the closeness to sample data GMM marks higher than other models to confirm GMM is fit to test.

    3.3.Selecting suitable Gaussian Mixture Model and Estimating Parameter

    The various gaussian mixture models were applied to select optimal model. Various GMM is described in Fig 4, where GMM2 means the mixture of 2 gaussian models, GMM3 of 3, GMM4 of 4, GMM5 of 5, GMM6 of 6 gaussian models.

    However, the more gaussian models mixed, the more parameters outbound(July, 2013) created, so overfitting problem is raised. For this reason, formula (7) and (8) were used to calculate AIC and BIC which can solve overfitting problem.(Akaike, 1974; Schwarz, 1978).

    AIC = 2 ln L + 2 k
    (7)
    BIC = 2 ln L + k ln n
    (8)

    n:sample size

    k: number of estimated parameters in the model

    L: maximized value of the likelilhood function for the model

    On the other hand, unnecessary models could be composed with a lot of parameters. However, penalty will be imposed in this case to prevent constituting complex model. It is so called the principle of parsimony. In the case of AIC, penalty is 2k, and the case of BIC, penalty is k·ln(n). So penalty of BIC is much harder than that of AIC since ln(n) is much larger than 2 when n is large. In these reasons, we adopted BIC.

    The comparison between AIC and BIC to each GMM for the outbound vessels in July is shown in Table 2. When it’s considered with AIC criterion, GMM6 is desirable. However, due to the overfitting problem, we chose BIC as the criterion and selected GMM3 as the fit model.

    GMM3 forms each gaussian distribution with the center of 256m, 27m, and 348m, and each ratio of data is 0.07, 0.59, and 0.35 respectively. It shows that data are clustered in u2(27 m) and u3(348 m) with large mixture coefficient of 0.59 and 0.35 respectively. Therefore, we can assume the parameters, u1, u2, ⋯, um, as commonly used mainly routes.Table 3

    Table 4 and 5 show traffic data for each month with suitable GMM by BIC criterion and the parameter is classified into inbound and outbound vessels. From April to June, GMM4 was fit for both inbound and outbound vessels, and from October to December, GMM4 was fit for inbound vessels and GMM3 was fit for outbound vessels.

    For sample data of inbound vessels in May, GMM4 is fit and it’s described in Fig. 5. Where μ forms gaussian distributions at –820 m, -77 m, 206 m, and 422 m, and each mixture coefficient(c) was 0.04, 0.08, 0.37, and 0.51 respectively.

    The goal of modeling is to get suitable probability distribution which can well express the given sample’s distribution. In reality, however, it’s not too much to say that it’s impossible to describe sample distribution into one model. The alternative way is to use GMM which can approximate various data sets by using multiple gaussian distribution functions, so GMM is considered as the fit model for maritime traffic distribution.

    4.Conclusions

    Normal distribution is broadly used to assume probability distribution function. However, complex distribution data like multimodal are very hard to be estimated by using normal distribution function only, and there might be errors when other distribution functions including normal distribution function are used.

    In this study, we experimented to find fit probability distribution function in multimodal area, by using AIS observation data gathered in Mokpo port for a year of 2013.

    As the result of this study, GMM is the fittest model rather than other distribution functions, such as EV, GEV, logistic, and normal distribution. GMM was found to the fit model regard to multimodal data of maritime traffic flow distribution. Data were clustered in the mean(μ) which has large mixture coefficient(c ), so we can assume the parameters, u1, u2, ⋯, um, as commonly used mainly routes.

    Probability density function for collision probability and traffic flow distribution will be calculated much precisely in the future. We hope this advance would help enhancement of navigation safety and vessel traffic services.Table 5

    Figure

    KOSOMES-21-253_F1.gif

    Scope of study area (Mokp port, Korea).

    KOSOMES-21-253_F2.gif

    Study procedure to select the suitable traffic distribution function.

    KOSOMES-21-253_F4.gif

    Various gaussian mixture model fitting.

    KOSOMES-21-253_F3.gif

    Distribution fitting.

    KOSOMES-21-253_F5.gif

    Gaussian mixture model fitting.

    Table

    X2of models

    AIC & BIC of models

    Model parameters of outbound(July, 2013)

    Type & parameters of GMM(inbound)

    Type & parameters of GMM(outbound)

    Reference

    1. AASHTO (2014) LRFD Bridge Design Specifications. Customary US. Units, pp.141-161
    2. Akaike H (1974) A New Look at the Statistical Model Identification , IEEE Transactions on Automatic Control, Vol.19 (6) ; pp.716-723
    3. Fangliang X , Ligteringen H , Gulijk CV , Ale B (2012) AIS Data Analysis for Realistic Ship Traffic Simulation Model , Proceedings of IWNTM' 2012, pp.44-49
    4. Fujii Y , Yamanouchi H , Mizuki N (1974) A Study Factors Affecting the Frequency of Accidents in Marine Traffic , Journal of Navigation, pp.239-247
    5. Giuliana P , Vespe M , Bryan K (2013) Vessel Pattern Knowledge Discovery from AIS Data : A Framework for Anomaly Detection and Route Prediction , Entropy, Vol.15; pp.2218-2245
    6. Gonzalez-Longatt FM , Rueda JL , Erlich I , Bogdanov D , Villa W (2012) Identification of Gaussian Mixture Model using Mean Variance Mapping Optimization: Venezuelan Case , . 2012 3rd IEEE Pes Innovative Smart Grid Technologies Europe(ISGT Europe), pp.1-6
    7. MATLAB (2014a) Programming. MATLAB version 8.3 (R2014a),
    8. MATLAB (2014b) Statistical Toolbox. Fit Probability Distribution object to data. MATLAB Version 8.3(R2014a),
    9. MATLAB (2014c) Statistical Toolbox. Fit Gaussian Mixture Distribution to data. MATLAB Version 8.3(R2014a),
    10. Liu ZB , Fu YH , Cong YS (2013) The Simulation of Vessel Traffic Flow Based on Congruential Generator , International Conference on Remote Sensing. Environment and Transportation Engineering(RSETE), pp.179-182
    11. Qiang M , Weng J , Li S (2014) Analysis of AIS-based Vessel Traffic Characteristics in the Singapore Strait , 93rd Annual Meeting of Transportation Research Board, pp.1-19
    12. Ravindra S , Pal BC , Jabr RA (2010) Statistical Representation of Distribution System Loads Using Gaussian Mixture Model , IEEE Transactions ON Power Systems, Vol.25 (1) ; pp.29-37
    13. Schwarz GE (1978) Estimating the Dimension of a Model , Annals of Statistics, Vol.6 (2) ; pp.461-464
    14. Silveira PAM , Teixeira AP , Guedes Soares C (2013) Use of AIS Data to Characterise Marine Traffic Patterns and Ship Collision Risk off the Coast of Portugal , Journal of Navigation, Vol.66; pp.879-898
    15. Yim JB (2010) Development of Collision Risk Evaluation Model Between Passing Vessel and Mokpo Harbour Bridge , Journal of Korean Navigation and Port Research, Vol.34 (6) ; pp.405-415
    16. Yim JB , Kim DH (2010) Statistical Parameter Estimation to Calculate Collision Probability Between Mokpo Harbor Bridge and Passing Vessels , Journal of Korean Navigation and Port Research, Vol.34 (8) ; pp.609-614
    17. Wikipedia (2015) Tutorial for Goodness of fit. http://www.en.wikipedia.org/wiki/Goodness_of_fit,