1. Introduction
Effective communication at sea is a key element in ship operations. Accurate communication is essential for the safe and efficient operation of vessels, and unclear communication during onboard and external exchanges can lead to significant human, material, and environmental losses (Horck, 2005).
Against this backdrop, the International Maritime Organization (IMO) has made efforts to enhance the effectiveness and accuracy of onboard communication by establishing various regulations, guidelines, and educational programs (Noble, 2007). One representative example is the Standard Marine Communication Phrases (SMCP). Given that multinational and multicultural crew members operate and live together on a single vessel, there is a high possibility of communication errors. To address this issue, SMCP was developed to standardize phrases used in specific situations, ensuring the delivery of unified concepts based on a common language (IMO, 2002). Through these SMCP and relevant guidelines, the IMO provides standardized phrases for communication between ships or between ships and shore stations to ensure accurate and efficient exchanges.
To this end, maritime education institutions incorporate SMCP-based maritime English education following STCW requirements and IMO model courses. Unlike general language education, maritime English education requires specialized teaching methods, necessitating an approach distinct from conventional English instruction (IMO, 2015).
Recently, Large Language Models (LLMs) have been increasingly applied across various maritime industry sectors, including maritime education (Dijkstra et al., 2023). These technologies have been integrated into navigational decision support systems and used in training programs at maritime academies, such as ship simulation, navigation, and maritime safety management (Pei et al., 2024).
LLMs can serve as valuable tools for natural language processing and generation. However, as mentioned earlier, maritime English possesses distinct grammatical and lexical characteristics compared to general natural language, and the use of standardized phrases is crucial depending on the situation. Therefore, this study aims to analyze the capability of general-purpose LLMs in handling maritime English and assess how well these models are correctly utilizing maritime English in compliance with international guidelines.
To achieve this, the study compares and analyzes the SMCP utilization ability of online models (ChatGPT-4o, Google Gemini 1.5 Pro) and offline models (Meta LLaMA 3) using a set of 60 standardized questions. By doing so, it examines the differences in model performance based on question types and difficulty levels.
Subsequently, this study compares and analyzes the maritime English proficiency of maritime students with the maritime English processing performance of LLM models to assess the AI models’ understanding and sentence construction capabilities in maritime English.
Furthermore, it examines areas requiring improvement and explores the potential applications and limitations of LLMs in maritime English education and assessment.
2. LLM Model Evaluation Method for Maritime English Proficiency
2.1 International standard on Maritime English
IMO adopted Standard Marine Communication Phrases (SMCP) through Resolution A.918(22) to present an international standard for standardized phrases aimed at ensuring clear and unambiguous communication at sea. SMCP is divided into External Communication Phrases (PART A) and On-board Communication Phrases (PART B). PART A is an essential component of the curriculum, while PART B includes phrases that maritime officers are implicitly expected to be familiar with. Although SMCP cannot encompass all aspects of maritime English, it defines essential parts of maritime communication and standardizes phrases to facilitate their application in education and practice.
SMCP has distinct grammatical, lexical, and idiomatic expressions different from general English. This distinction is intended to simplify language as much as possible to reduce communication errors and maximize clarity. This specificity makes SMCP particularly effective in situations where time is limited, such as emergencies or navigational warnings, where psychological pressure is high (Rahman and Liton, 2017).
In conclusion, SMCP does not encompass the entirety of maritime English but serves as an essential communication guideline that includes only the necessary components required for becoming a maritime officer. Furthermore, as it differs from general English in terms of structure, grammar, and vocabulary, it can be regarded as a specialized form of English distinct from standard English. According to the international standard STCW, maritime officers must acquire and utilize this specialized SMCP. This study aims to assess maritime English proficiency based on standardized essential SMCP phrases, aligning with the unique characteristics of maritime English (Lunde, 2017).
2.2 Maritime English Evaluation and Scoring Method
The Maritime English platform is structured based on IMO SMCP and consists of 60 questions across four domains: on-board communication, navigational communication, cargo handling communication, and emergency communication. Each question directly utilizes the original SMCP text, reflecting phrases used in real-life maritime situations.
The difficulty of the questions is categorized into three levels: easy, intermediate, and difficult, distributed at 20%, 60%, and 20%, respectively. The difficulty distribution was designed with input from individuals with three to five years of experience in Maritime English education.
The sentence-based questions are extracted from SMCP Part A and Part B and evaluate sentence construction ability. These questions assess whether participants can construct sentences using appropriate SMCP-based vocabulary. Figure 1 shows how sentence questions are formed.
For the glossary-based questions, participants are required to provide explanations for terms listed in SMCP Part A, Glossary. Figure 2 illustrates how glossary questions are structured.
For the fill-in-the-blank questions, specific words are omitted from key phrases in SMCP Part A, requiring participants to complete the sentences with the appropriate terms. These questions, like the sentence-based ones, evaluate whether the correct maritime-specific vocabulary is used. Figure 3 illustrates how the fill-in-the-blank section is formed.
A total of 60 questions were designed, consisting of 24 sentence-based questions, 18 glossary-based questions, and 18 fill-in-the-blank questions (4:3:3 ratio). This design reflects the actual testing environment for students, where sentence-based questions were given a slightly higher proportion. The reason for assigning greater weight to sentence-based questions is that, on average, these questions typically require around 29 characters per response, whereas glossary and fill-in-the-blank questions require only about 13 and 7 characters per answer respectively. Since the testing system is intended to both maximize learning outcomes and accurately assess student competencies, the proportion of sentence-based questions was weighted more heavily.
Responses generated by LLM models were assessed using PHP's Similar_Text algorithm, which calculates similarity scores based on character matching. This algorithm, described in Implementing the World's Best Algorithms by Oliver, determines the degree of similarity between characters and strings. It selects the most similar reference sentence, recursively calculates matching prefixes and suffixes, and sums the lengths of common substrings across all recursive iterations to compute overall similarity.
This algorithm is widely used in plagiarism detection, spell checking, document similarity analysis, and translation evaluation.
The formula for similarity calculation is as follows (Seor et al., 2023).
For example, when comparing the words ‘course’ and ‘coarse’ the algorithm sequentially identifies matching characters (‘c’, ‘o’, ‘r’, ‘s’, and ‘e’), while recognizing the unmatching characters (‘u’ and ‘a’). Based on 5 matched word characters (corresponding to 10 characters total) out of a total of 12 characters, it returns a similarity score of approximately 83.33%.
Particularly, this algorithm allows grading based on overall sentence similarity, meaning that minor spelling errors or omitted words do not automatically result in a completely incorrect answer. Instead, only the non-matching portions are deducted, ensuring that partial credit can be awarded rather than employing a strict all-or-none scoring system.
This method is different from semantic-based calculation, which focuses on the meaning of the text rather than the structure of the text. For example, BERT, which compares semantic similarity, shows that ‘course’ and ‘coarse’ have a similarity score of 31.58%. The structure of the two words is similar, but the meaning between them differs significantly. This concept will be explained in a later chapter.
2.3 Analysis of Applied Models
This study compared and analyzed the following three LLM models:
ChatGPT is a large-scale language model (LLM) developed by OpenAI, first released on November 30, 2022. After continuous improvements, ChatGPT-4o, based on GPT-4, was officially launched on May 2024. This model has been trained on a vast amount of text data, allowing it to generate and understand text at a level similar to that of humans (OpenAI, 2024). It excels in natural language understanding and generation, enabling human-like conversations, programming, writing, and translation. It also possesses high situational adaptability, understanding user intent contextually and generating appropriate responses.
Google Gemini 1.5 is a multimodal LLM developed by Google, released in May 2024. It can process various input types, including text, images, audio, and video. Benchmark results indicate state of the art performance, particularly in natural language understanding, reasoning, and text generation. Gemini 1.5 has been trained on a vast dataset, including professional literature, research papers, and programming code, demonstrating its capability in handling specialized terminologies in fields such as law, medicine, and engineering. Furthermore, it can continuously learn from newly added information, ensuring that it remains updated with the latest knowledge (Google, 2024).
Meta LLaMA 3 70B Instruct is an open-source LLM developed by Meta AI. Unlike ChatGPT and Google Gemini, which operate in the cloud environment, LLaMA 3 70B Instruct can run in a local environment, offering advantages in data security and privacy protection. This feature makes it suitable for restricted-network environments (Meta AI, 2024).
LLaMA 3 70B Instruct consists of 70 billion parameters, ensuring high performance and allowing for long-text processing with a context window of 4096 tokens. Additionally, it has been trained on over one million human-annotated datasets, reducing errors and enhancing performance in complex conversations, contextual understanding, and reasoning.
This study evaluated the ability of these three LLM models to utilize standardized SMCP phrases. All three models were given the default prompt ‘Assume that you are a navigational officer onboard a vessel. Answer the questions according to the Standard Maritime Communication Phrases’. The responses generated by each model were scored using the PHP Similar_Text algorithm and stored in MySQL Workbench 8.0 CE for further analysis.
2.4 Student Test Data
To gain a more precise understanding of LLM model performance, this study compared the test results of maritime high school students with those of the LLM models. The dataset used was obtained from second-year, total of 56 students in the Navigation Department of a maritime high school during the first semester of 2023.
The students had basic knowledge of both general and maritime English, following the maritime officer training curriculum. These students, having completed their first-year theoretical coursework, were applying their knowledge through onboard training at the time of the test.
The exam, consisting of the same questions given to the LLM models, contained 20 questions and was conducted three times within a 20-minute duration.
3. The Results Analysis
3.1 Analysis of Test Results by Model
The analysis of sentence-based test scores for LLMs (Large Language Models) showed that ChatGPT recorded 79.96 points, demonstrating the best performance, followed by Gemini with 75.79 points. In contrast, Llama scored 58.80 points, showing a significant gap compared to the top two models. The average score of the students was 56.75 points, which was similar to Llama but relatively lower than the other online LLM models.
In the Maritime English glossary test, ChatGPT again achieved the highest score of 83.05 points, followed closely by Gemini with 82.60 points. Meanwhile, Llama recorded 70.34 points, showing a considerable gap compared to the top two models. Students recorded the lowest score of 56.92 points.
For the fill-in-the-blank test, ChatGPT recorded the highest score of 99.57 points, while Gemini followed with a relatively high score of 90.26 points. Llama scored 60.04 points, again showing a notable difference from the top two models, while students recorded the lowest score of 52.00 points.
The converted scores for each model were calculated based on a total of 2000 points. ChatGPT achieved 1735.4 points, which corresponds to 86.77 points on a 100-point scale. Gemini followed with 82.17 points, while Llama scored 62.64 points. Students recorded the lowest score, averaging 55.38 points. While ChatGPT and Gemini, which run on online servers, achieved relatively high scores above 80 points, Llama, which operates offline, recorded lower scores in the 60-point range.
One of the notable patterns was that LLM models scored higher on vocabulary-based questions than on sentence construction questions. As the scores indicate, all three models performed better on vocabulary-based questions than on sentence construction questions, a trend which was not observed among students. Table 1 provides a detailed comparison of Maritime English test scores by question type for LLM models and students.
After analyzing the responses from LLM models, it was found that they performed well on questions with clear and straightforward answers, particularly in short word fill-in and glossary questions, where they demonstrated a high accuracy rate. It is, to some extent, natural to observe such a result considering the operating principles of LLMs, which are trained to predict the most appropriate next word within a sentence. Yet, from the students' perspective, because the correct answers were mainly short words, even minor spelling mistakes or small errors could lead to significant score deductions. This appears to have contributed to the students' relatively low scores on blank-fill-in questions.
Meanwhile, when constructing predefined, relatively long and standardized SMCP-based sentences, LLM models, despite excelling at generating phrases suited for everyday language, tended to score lower compared to other question types. Table 2 presents the Maritime English test scores of LLM models and students by question difficulties.
In the easy-level questions, ChatGPT achieved the highest score of 88.85, followed by Gemini with 86.24. Meanwhile, Llama scored 63.07, and students recorded 63.29, showing similar performance between the offline model and students.
For middle-difficulty questions, the overall score gap tended to narrow. ChatGPT scored 85.06, and Gemini scored 80.98, maintaining strong performance. However, Llama scored 63.10, while students scored 57.05, showing a wider gap compared to sentence-based questions.
In the high-difficulty questions, the overall score differences became more pronounced. ChatGPT and Gemini scored 89.82 and 81.69, respectively, demonstrating stable performance even at higher difficulty levels. In contrast, Llama scored 60.83, and students recorded 42.45, showing a relatively large performance gap.
While students' scores showed a clear pattern—higher scores on easier questions and lower scores on more difficult ones—LLM models did not follow this trend, suggesting a different response pattern in handling varying difficulty levels.
Overall, the online LLM models outperformed both the offline model and students, consistently achieving higher scores. In particular, ChatGPT and Gemini demonstrated stable performance across all difficulty levels, maintaining high accuracy from easy to hard questions. Figure 4 is the score comparison chart among LLM models.
To determine whether there were statistically significant differences in scores between groups, a statistical analysis of the test results was conducted. First, a Shapiro-Wilk normality test was performed, which indicated that the data did not follow a normal distribution. As a result, a non-parametric Friedman test was conducted, revealing a significant difference among the three models (p < 0.0001).
A Nemenyi post-hoc test was then performed, showing that the mean difference between ChatGPT and Gemini was 4.57, with a p-value of 0.778, indicating no statistically significant difference. However, ChatGPT vs. Llama (mean difference 24.13, p-value 0.001) and Gemini vs. Llama (mean difference 19.56, p-value 0.001) showed statistically significant differences.
These results indicate that the online LLM models, ChatGPT (mean 86.77, median 98.22) and Gemini (mean 82.2, median 95.0), demonstrated statistically higher performance in Maritime English tasks compared to the offline model, Llama (mean 62.64, median 61.81). Table 3 shows pairwise comparisons of the three LLM models' performances after the Friedman test.
Figure 5 represents the score distribution of the three LLM models in the Maritime English test. It visualizes the distribution of scores across 60 test questions, displaying the median, the interquartile range between the 25th and 75th percentiles, and outliers for each model. As seen in the graph, ChatGPT's scores are mostly positioned in the higher range, although some outliers appear in the lower score range. Gemini, while having a lower median than ChatGPT, still shows a score distribution concentrated in the upper range. In contrast, Llama has a comparatively lower median, with greater variability in scores, displaying a wider range of distribution patterns.
3.2 Analysis of Scoring Penalty Factors of LLM Models
The errors in LLM models can be classified into three main types. The first type is errors caused by literal translation of sentences instead of using idiomatic phrases commonly used in practice, which disrupts the flow of communication. A representative example is when responding to the phrase ‘본선은 항행 중이다’, LLM models answered ‘This vessel is not underway’. In Korean maritime communication, ‘본선’ or ‘귀선’ is used instead of ‘나(I)’ or ‘너(You)’, but LLM models translated ‘본선’ literally as ‘This Vessel’, leading to point deductions.
The second type of error occurs when the model either fails to provide an answer or generates an irrelevant response outside the given context. For instance, when asked ‘Is the vessel seaworthy?’ (선박이 감항성이 있는가?), the model incorrectly responded with ‘Is the vessel stable?’ (선박이 안정성이 있는가?). While the two sentences may appear similar, ‘stability’ (안정성) and ‘seaworthiness’ (감항성) are distinct concepts in maritime communication, which resulted in a deduction.
The third type of error involves using non-SMCP and non-specialized vocabulary. In the phrase ‘예인줄이 본선의 프로 펠러에 감겼다’, the appropriate SMCP term for ‘감겼다’ is ‘foul’, yet LLM models used ‘entangle’, which is a more general term rather than the precise technical terminology required in SMCP.
Although the models demonstrated a generally high accuracy rate, some sentences still showed low conformity to SMCP or the models failed to fully grasp idiomatic expressions.
Additionally, there were instances where the models either did not generate a response or produced responses containing errors, indicating that direct use in practice or educational settings without proper guidance requires caution.
3.3 Comparative Analysis of Maritime English Proficiency Between Junior Officers and LLM Models
To determine the appropriate minimum Maritime English proficiency score for junior officers completing maritime training, consultations were conducted with seven maritime industry experts (holding at least a 2nd-class license and with over seven years of experience). The results indicated that a minimum score of 1,139 out of 2,000 is generally required for a candidate to be considered suitable for onboard service. This translates to an average score of at least 56.95 per question. When compared to this standard, the LLM models met the required score threshold based purely on numerical results. However, as previously discussed, these scores alone do not guarantee practical applicability. The responses generated by the LLM models contained errors that would be unsuitable for real-world maritime operations, indicating that they are not yet fully appropriate for direct use in practical or educational settings. Further research and analysis will be necessary to assess their actual applicability in maritime training and operations.
3.4 BERT-Based Semantic Similarity Analysis
Lastly, the research employed a BERT-based transformer model to evaluate the semantic similarity between LLM-generated responses and the correct SMCP phrases. This approach enables the comparison of not only the structural composition of sentences but also the underlying semantic meaning, which can be utilized in future competency tests assessing students’ SMCP usage. BERT, originally developed by Google Research, are designed for capturing contextualized semantic meaning within a language. Unlike character-based similarity measures like Similar_Text, BERT-based models encode sentences into a high-dimensional vector space and calculate similarity based on cosine similarity. This makes it possible to check if two sentences have similar meaning, even if they are written differently, which is useful for future testing systems when applied. The equation (2) illustrates how BERT-based sentence-transformers functions to compare the meaning of two sentences based on cosine similarity.
-
A and B = sentence embedding vectors
-
A ∙ B = dot product in n-dimensional space
-
i = Index of the vector component
-
n = Dimensionality of the sentence embedding vector space
BERT is lightweight and suitable for sentence embedding and semantic similarity tasks. Among them, all-MiniLM-L6-v2, which is distilled version of BERT, is especially fast and compact while maintaining its core functionalities (Yin and Zhang, 2024). Since the model is stored in Google Cloud for processing input texts from students and users, it is essential to keep the models lightweight, efficient, affordable and accurate. For this purpose, all-MiniLM-L6-v2 was chosen for semantic analysis and real-time processing. Table 4 shows the score comparison between two models.
The all-MiniLM-L6-v2 similarity scores for the three LLM models are 87.77 for ChatGPT, 79.50 for Gemini, and 58.65 for Llama. When compared to the results from the Similar_Text algorithm, the trend remains consistent. However, a clear distinction exists when comparing these two models. For example, as shown in Figure 6, for the phrase ‘I am not underway’, when the input is ‘I am not under way’, the score changes by more than 30 points. Even though the sentence structure is similar, the meanings of underway (A vessel which is not at anchor, or made fast to the shore, or aground) and under way (progress for general terminology) differs. This semantic difference leads to variations in similarity scores.
4. Conclusion
A commercially available LLM model evaluation was conducted to assess their ability to use Maritime English through an online testing system designed for Maritime English education. The evaluation included sentence-based questions, vocabulary questions, and context-based fill-in-the-blank questions. The results showed that ChatGPT achieved an average score of 86.77, followed by Gemini with 82.17, and Llama with 62.64.
When analyzing scores by question type, vocabulary-based questions, where answers were more clearly defined, yielded higher scores than sentence-construction questions, which required the generation of standardized phrases. Statistical analysis of the scores indicated that there was no significant difference between ChatGPT and Gemini, but both models significantly outperformed Llama. Additionally, all LLM models scored higher than students, with online server-based models (ChatGPT, Gemini) performing better than the offline model (Llama).
However, LLM models exhibited some errors in using specialized idiomatic expressions and maritime terminology. While they demonstrated a generally high accuracy rate, some responses showed lower conformity to SMCP or a lack of understanding of idiomatic expressions. In particular, misinterpretations or response errors were observed when handling terms and expressions used in real-world maritime operations. Given that Maritime English is designed to minimize communication misunderstandings through standardized phrases, the immediate application of LLM models in educational or operational settings requires further review in terms of stability and accuracy.
This study examined the performance evaluation and practical applicability of commercial LLM models in Maritime English usage. Since the analysis was limited to SMCP-based phrases and a small selection of models, future research should incorporate a broader dataset, a wider range of models, and diverse analytical methods to conduct a more in-depth study.













