[DeepTecTok #2] Model Improvement through Fine-tuning of Translation-specialized LLM

TecAce Software
May 21, 2024
7 min read

The following text has been translated from Korean to English using AssistAce.

Sungjin (James) Kim, Ph.D. | LinkedIn | YouTube

Case Study of Translation LLM Fine-Tuning

Introduction

The technology of Large Language Models (LLMs) is advancing rapidly. There are various ways to utilize LLMs, including prompting, embedding, and fine-tuning. In this article, we will focus on fine-tuning, which requires a significant amount of GPU computing resources.

While it is possible to secure high-performance computing resources on devices, using cloud environments offers the advantage of reducing complexity. In line with this, there are many specialized cloud services that make AI training and inference more convenient. Vessl is one such service that considers not only LLMs, but also generative AI as a fundamental part[1]. In this article, we will explore a case study of fine-tuning the M2M100 model, one of the LLMs, using the Vessl hub[2,3].

M2M100 is a transformer-based LLM developed by Meta. It consists of an encoder and a decoder, and supports the translation of 100 languages, enabling multi-language translation.

Creating an AI Development Environment with Vessl

The instructions for using Vessl are explained on the relevant website [1]. You can access Vessl through the Vessl Hub and use it with the Vessl command-line tool. In this guide, we will focus on how to fine-tune translation models using Vessl, rather than the basic usage of Vessl.

Overview - VESSL AI

Vessl Hub provides support for Jupyter as the default coding environment. Jupyter is a popular coding environment among AI developers. To create an AI environment with Jupyter, you can use the following commands:

$ poetry shell
$ cd vessl_use
$ vessl run create -f jupyter-notebook.yaml

Since Vessl is installed as a Python command, we have converted it into an environment that can use Vessl using Poetry. Also, since Poetry stores user code in a subfolder, we changed the folder to vessl_use. The Vessl command can be executed with the run command. By using create, you can create an environment that uses Vessl's cloud service according to the instructions in the given yaml file.

The following is a sample yaml file for Vessl that sets up an environment where Jupyter can be used interactively. This file is provided as one example of how to use Vessl on the Vessl website.

name: gpu-interactive-run
description: Run an interactive GPU-backed Jupyter and SSH server.
tags:
  - interactive
  - jupyter
  - ssh
resources:
  cluster: vessl-gcp-oregon
  preset: gpu-l4-small-spot
image: quay.io/vessl-ai/torch:2.1.0-cuda12.2-r3
interactive:
  max_runtime: 8h
  jupyter:
    idle_timeout: 120m

In the resources section, we are using the cluster available in vessl-gcp-oregon, and the preset corresponds to the machine specification gpu-l4-small-spot. We chose an image provided by Vessl that is based on Nvidia Cuda and Pytorch for production use. Regarding the interactive environment, the maximum runtime is set to 8 hours, and the maximum idle time allowed by Jupyter is set to 120 minutes.

M2M100 Fine-tuning

The M2M100 model was fine-tuned and tested using the following steps.

Step 1. Install Packages

Install the following packages required for fine-tuning.

transformers==4.40.0
sentencepiece==0.2.0
accelerate==0.29.3
datasets==2.19.0

Here, sentencepiece is a package used for language-independent tokenization, which is used in the Tokenizer of the LLM model. Additionally, accelerate is a package that supports acceleration of deep learning training on various hardware such as CPUs, GPUs, and TPUs, regardless of PyTorch or TensorFlow. For example, if the necessary preprocessing, such as importing the loss calculation part, has been done for PyTorch, acceleration can be achieved by simply changing from loss.backword() to accelerator.backward(loss) during the training step.

Step 2. Import Packages

Import the necessary packages for fine-tuning.

import json

from transformers import (M2M100ForConditionalGeneration, M2M100Tokenizer,
                          Seq2SeqTrainer, Seq2SeqTrainingArguments)
from datasets import Dataset

The json package is imported to handle data loading.

From the Hugging Face's transformers package, import four classes, including M2M100ForConditionalGeneration and M2M100Tokenizer, which are designed to handle the M2M100 translation model. The M2M100Tokenizer class is based on the sentencepiece package, which can tokenize more than 100 languages. On the other hand, commonly used decode-based LLMs like Llama2 can load models and tokenizers through the automated AutoModelForCausalLM and AutoTokenizer.

Import the Dataset class from the datasets package, which allows data-related processing.

Step 3. Create a Class for M2M Translation

The M2M translation class initializes with the 418M-sized m2m100 model and tokenizer, and constructs related member functions.

class M2M:
    """M2M100"""
    def __init__(self, ft_fold, src_lang:str="ko", tgt_lang:str="en"):
        """M2M100 model and tokenizer initialization"""
        self.model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
        self.tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
        self.src_lang = src_lang
        self.tgt_lang = tgt_lang
        self.tokenizer.src_lang = src_lang
        self.tokenizer.tgt_lang = tgt_lang

    def trans(self, input_text:str):
        """Translate input_text from source language to target language"""
        encoded_pt = self.tokenizer(input_text, return_tensors="pt")
        generated_tokens = self.model.generate(**encoded_pt,
            forced_bos_token_id=self.tokenizer.get_lang_id(self.tgt_lang))
        output = self.tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
        return output[0]

    def print_trans(self, d_source_list: list[str]):
        """Print translation of source text"""
        for idx, input_text in enumerate(d_source_list):
            output_text = self.trans(input_text)
            print(f"{idx+1}. {input_text} ==> {output_text}")

    def encode(self, data):
        # Assuming tokenizer is already initialized and configured
        inputs = self.tokenizer(data['source'], padding="max_length", truncation=True, max_length=128)
        outputs = self.tokenizer(data['target'], padding="max_length", truncation=True, max_length=128)
        return {
            'input_ids': inputs['input_ids'],
            'attention_mask': inputs['attention_mask'],
            'labels': outputs['input_ids']  # using output input_ids as labels for training
        }

    def tokenize(self, data_dict: dict):
        raw_datasets = Dataset.from_dict(data_dict)
        split_datasets = raw_datasets.train_test_split(test_size=0.2)  # 80% train, 20% test
        tokenized_datasets = split_datasets.map(self.encode, batched=True)
        return tokenized_datasets

First, during the initialization process of the class, the AI model and tokenizer required for M2M translation are loaded. Then, member functions for performing translation and displaying translation results are created as trans() and print_trans() respectively.

During the fine-tuning process, the tokenize() function performs tokenization, and the encode() function used here is configured. In the tokenize() member function, the data is split into 80% for training and 20% for testing. And in the encode() function, the maximum input and output lengths are limited to 128. Although the maximum length of the M2M100 model's input and output is 1024, it is limited to a shorter length to reduce model complexity and response time. Reducing the length helps improve translation performance as longer lengths can cause issues with attention. However, it may lead to issues such as inconsistency in context or word usage among sentences translated in order. Various approaches can be considered to address this, such as translation with overlapping parts, translation with consistency after translation, and translation using a terminology glossary.

After fine-tuning, a separate function is created to perform translation using the newly created code.

def trans(model, tokenizer, input_text:str, src_lang:str="ko", tgt_lang:str="en"):
    """Translate input_text from source language to target language"""
    tokenizer.src_lang = src_lang
    encoded_pt = tokenizer(input_text, return_tensors="pt")
    generated_tokens = model.generate(**encoded_pt,
        forced_bos_token_id=tokenizer.get_lang_id(tgt_lang))
        # max_length=200, early_stopping=True, num_beams=5)
    # generated_tokens = self.model.generate(**encoded_pt)
    output = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    return output[0]

Step 4. Loading Data for Fine-tuning

Create functions to load the necessary data for fine-tuning and print the contents of the data.

def load_data(data_file:str = 'ft_data/data_m2m.json'):
    """Load data from data_m2m.json"""
    with open(data_file, "r", encoding="utf-8") as f:
        data_dict = json.load(f)
    return data_dict

def print_data_dict(data_dict: dict):
    for idx, (input_text, output_text) in enumerate(zip(data_dict["source"], data_dict["target"])):
        print(f"{idx}. {input_text} ==> {output_text}")

The load_data function above loads the data in Json format. The print_data_dict function prints the contents of the loaded data.

The training data for fine-tuning consists of sentences in the original language and their translations, which are stored in the "source" and "target" keys of the data dictionary, respectively. Here is an example of the data:

{"source":["안녕", "안녕, 나는 인공지능 로봇이야.", "어떻게 도와줄까?", "나는 인공지능 로봇이야, 너를 도와주기 위해 여기 있어."],
"target":["Hi", "Hi, I am a AI robot.", "How can I help you?", "I am a AI robot, I am here to help you."]}

The training data is designed with the following intentions. Since "안녕" (annyeong) is a friendly expression in Korean, it is translated to "Hi". Additionally, "인공지능 로봇" (ingongjineung lobot) is a newly used term, so it is expressed as "AI robot".

Step 5. Creating the Fine-tuning Function

To perform fine-tuning, we need to set the necessary parameters and start the training process.

def finetune(m2m:M2M, data_dict:dict, ft_fold:str):
    tokenized_datasets = m2m.tokenize(data_dict)

    training_args = Seq2SeqTrainingArguments(
        output_dir=ft_fold,           # Where to store the final model
        evaluation_strategy='epoch',      # Evaluation is done at the end of each epoch
        learning_rate=5e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        weight_decay=0.01,
        save_total_limit=3,
        num_train_epochs=3,
        predict_with_generate=True
    )

    trainer = Seq2SeqTrainer(
        model=m2m.model,
        args=training_args,
        train_dataset=tokenized_datasets['train'],
        eval_dataset=tokenized_datasets['test'],
        tokenizer=m2m.tokenizer
    )

    trainer.train()
    trainer.save_model(ft_fold)

The learning rate is set to 5e-5, and the number of training epochs is set to 3. The training and evaluation datasets are divided from the loaded data, and an instance of the training class is created. The training is performed, and the results are saved in the specified folder.

Step 6. Performing Fine-tuning

The following steps are taken to perform fine-tuning using the classes and functions created so far.

"""Main function for m2m_ft_ahnlab.py"""
FT_FOLD = './ft_fold'
m2m = M2M(FT_FOLD)

data_dict = load_data()

print("\\nBefore finetuning")
m2m.print_trans(data_dict["source"])

print("\\nTraining data")
print_data_dict(data_dict)

finetune(m2m, data_dict, FT_FOLD)

First, an instance of the model class is created, and the translation of the model is checked before fine-tuning. After that, fine-tuning is performed.

Step 7. Verifying the Results Using the Fine-tuned Model

The trained model and tokenizer are loaded from the folder where the training results are stored, and they are used to see how the translations are performed.

# %% Load model
model_dir = FT_FOLD # Adjust this to your specified output_dir
# Load the trained model
model = M2M100ForConditionalGeneration.from_pretrained(model_dir)
tokenizer = M2M100Tokenizer.from_pretrained(model_dir)

print("After Fine-tuning")
for input_text in data_dict["source"]:
    output_text = trans(model, tokenizer, input_text)
    print(f"{input_text} ==> {output_text}")

During the pre-training stage, "안녕" was translated to "Hello". However, after fine-tuning, the desired translation to "Hi" is observed.

Key Points

In this article, we explored the use of fine-tuning to translate specialized LLM (Large Language Model) models into desired terms or sentence patterns. Fine-tuning can be applied not only to translation models but also to other LLM models to adapt them to specific tasks. Although we did not delve into the specifics in this article, it is important to note that optimal results can be achieved by carefully preparing the dataset and fine-tuning parameters.

TecAce is making various efforts to respond to the needs of its customers using artificial intelligence, including Generative AI such as LLM, through its in-house expertise and key collaborations. In particular, we are preparing to provide a full stack for businesses utilizing artificial intelligence, such as AssistAce and ResourceAce, which are AI software tools, GPU hardware, and cloud services. In the future, we will continue to strive to increase customer satisfaction in the field of artificial intelligence technology and transformation.