[Case Study] Creating a Specialized Translation Model for Enterprises Using User Data-Based Fine-Tuning

TecAce Software
Dec 29, 2023
5 min read

Overview:

TecAce leads the development of AI services that can improve work productivity Using AI Technology. We employ generative AI to develop technology that can transform documents into the desired format and summarize them. We also apply this technology to translation to provide translation services in professional fields. Through our proprietary technology that evaluates and analyzes the results of generative AI, we have become a professional AI technology development company that offers reliable services to corporations and professional fields.

Challenge:

We have recently been receiving numerous translation requests from various clients. Clients expect to achieve faster and more efficient results through the use of generative AI. However, after personally testing ChatGPT, Bard, Naver Clova, etc., we have concluded that while they are generally satisfactory, they are not suitable for business purposes. In particular, because incorrect translations in specialized fields can negatively affect a company’s credibility, clients are hesitant to adopt generative AI despite recognizing its potential.

The recent translation requests from the client we are currently consulting for can be summarized as follows:

Translations must be accurate and in a style suitable for an expert in the field.
Professional terminology used within the company must be used as needed.
The company’s existing translation style must be maintained. Creating new sentences or words is not allowed.
There must be no mistranslations

Generally, when translating using GPT, the following issues can be found. Even when using the latest version of GPT like GPT-4, it was not easy to meet the client’s specific requirements as shown in Table 1 without a sophisticated prompt suitable for the given document. For example, when translating the phrase “컴퓨터 한글 문서(.HWP)를…” into English, a direct translation issue was found where the phrase was translated as “Computer Korean document(.HWP)…”. If GPT-4 does not have a prompt for the given content, there is no guide for the translation in the document, so there were limitations to translating certain phrases like ‘한글 문서’ into the desired format. Additionally, GPT-4 did not automatically handle complex Korean sentences containing multiple phrases unique to customer translations by translating them into simple English sentences using pronouns. Certain words like ‘어떤’ can be translated as certain or specific in English, and these words have different nuances, so it was not easy to match them with the translation style preferred by the customer.

Table 1. Translation Results Without Fine-Tuning and the Issues Identified

Category	Clients Human Translation	Translation Result Without Fine-Tuning (Prompt not applied)
Proper Noun: ‘한글 문서(.HWP)’	HWP Document	Korean Document(.HWP)
Complex Korean Sentence	Multiple Simple English Sentences	Complex English Sentence
Nuance: ‘어떤’	Certain	Specific

Solution:

Two methods were attempted to develop a solution. First, a translation solution for the field was prepared using GPT-3.0. A glossary was created based on numerous example documents, and translation tests were conducted using prompts for accurate translations of specific terms. Prompts created using this advanced prompt technology were able to provide translations that were to some extent satisfactory, being accurate and using professional terms. However, the results were not delivered in the style (using short sentences, applying nuances, etc.) that the customer desired.

Secondly, we applied fine-tuning using user data based on the GPT 3.5 Turbo model. Initially, a dataset of about 100 was created using existing translation data, and fine-tuning was attempted on the GPT 3.5 Turbo 4K model. However, the results were not satisfactory. Translation tests using the fine-tuned model did not yield better results compared to the GPT 4.0 prompt version.

Third, preparing over 1,000 translation datasets and systematically performing data organization and refinement, and optimally configuring the training hyperparameters before attempting fine-tuning on the GPT-3.5 Turbo 1106 model resulted in a translation output that was closer to the level required by corporations compared to the GPT-4 prompt version. In other words, the issues identified in the GPT-4 translation were resolved, and more words used in the reference translation provided by the corporation were used, resulting in a translation style that better followed the original.

The translation performance across models was compared using 20 documents and three metrics (BLEU, METEOR, BERT). The results are summarized in Table 2. BLEU and METEOR are metrics that are often used to measure translation performance. These metrics compare sequences of N-grams, and in this case, N was increased up to 4. The fine-tuned models scored significantly higher than GPT 4, a model that was not trained on customer data, in both BLEU and METEOR. METEOR is similar to the BLEU metric, but it also takes into account synonyms and verb forms. The fine-tuned models showed slightly better performance in this metric. The statistical differences between the models are also summarized in the box plot in Figure 1. The BERT metric is based on machine learning and partially considers context. In this case, the fine-tuned models showed a higher precision score compared to the recall score. This means that the models produce translations that are more faithful to the source text and the reference translations while minimizing additional content. The recall score was slightly lower. However, this indicates that the top GPT models, which are trained on more information to represent all N-grams, are superior. Therefore, increasing the amount of training data can potentially improve the recall score in both cases. It is expected that the recall score of fine-tuned models will also improve when they are trained with more user data. In conclusion, a simple average of these metrics shows that GPT 3.5 Fintune-1106, which was trained with a large amount of data, has the highest overall score.

Table 2. Comparison of Translation Performance by Model

Comparison Using Boxplots of BLEU Scores

Comparison Using Boxplots of METEOR Scores

Figure 1. Comparison of box plots for BLEU and METEOR scores

Conclusion:

Using GPT-4.0 with various glossaries and prompts can be effective, but as the amount of glossary increases, it can lead to management and prompt dilution issues. The use of fine-tuning models with user data attempted this time showed the following advantages:

The GPT 3.5 fine-tuned model was better at matching the translation style that companies desired compared to the GPT 4.0 prompt version. GPT 4.0 had a tendency to create additional words that were not present in the original text.
The use of the GPT 3.5 fine-tuned model greatly reduced token usage costs. Approximately 8 times the cost could be saved.
Using the GPT 3.5 version significantly sped up the translation process compared to the GPT 4.0 prompt version.

While the GPT-4.0 prompt version did not have significant issues in terms of context or translation, it was evident that fine-tuned versions based on user data are much more effective for specific areas or purposes. However, fine-tuning requires essential skills and experience such as specialized data preprocessing and optimal hyperparameter utilization for learning.