top of page

[DeepTecTok #1] AI LLM Structure and Use of Transformer

Updated: Oct 3

ree

Sungjin (James) Kim, Ph.D. | LinkedIn


Table of Contents



Introduction


Generative AI (GenAI), including Large Language Models (LLMs), has seen significant advancements since the introduction of the transformer architecture[1]. Introduced in the 2017 Google paper "Attention is All You Need," the transformer architecture marks a departure from sequence-to-sequence (Seq2Seq) models, centering around the 'attention' mechanism that processes entire sequences in one go [1,2,3]. Following this, GPT adopted a structure that uses only the decoder, removing the encoder from the original transformer structure, and demonstrated superior performance in question-answering and text generation. This feature has led to numerous LLMs that utilize only the decoder[4]. However, structures that employ only an encoder and the traditional transformer structure that uses both an encoder and a decoder are also continuously being developed. Understanding the advantages and disadvantages of these different structures is an important element in grasping the trends in LLM technology. This article aims to examine the transformer structure adopted by most LLMs and introduce the various methods of utilizing it.


The Structure and Operating Principle of the Transformer


The structure of the method proposed in the Transformer paper is shown in Figure 1. The overall operation of the Transformer proceeds as follows: First, the input information passes through the encoder on the left, block by block, transforming into finalized encoded information. This final encoding information enters all individual decoder blocks. Next, the decoder decodes the output tokens using the encoded information and the information decoded so far. The outcome from the N decoders is converted into output tokens after a linear layer and softmax processing, completing the Transformer operation. This process generates each output signal and feeds it back recursively in an autoregressive approach.


ree

Figure 1. The Architecture of the Transformer Network

Image Source: [3]


Encoder Operation


The encoder section is implemented in three main parts: token embedding, positional encoding, and n encoding processing steps, through which input information sequentially passes to be transformed into encoded information. After all n encoding processes are complete, each decoder sends the final result equally.


Token Embedding


The token embedding part is a process that transforms each token's input into a vector, considering the relationships between tokens. For example, the vector for the token "son" would be created to be more similar to the vector for "prince" rather than "princess." Embedding transforms information into vectors, considering the interrelationships between tokens, thereby conveying more meaningful information than could be achieved with simple token indices arranged in sequence.


Positional Encoding


Because the structure of the Transformer doesn't inherently account for the order of input tokens, information about the sequence of tokens can be lost. To prevent this, positional encoding information, which adds location information to the input data, is performed after transforming to embedding vectors. Each token is turned into a vector of the same size as the dimension of the embedding vector, so the position is represented as a 2D tuple (pos, i), where pos is the index in the embedding dimension, and i is the sequence index of the token. The value derived by Formula 1 considering this 2D position value is then added to each position.


ree

Formula 1. The formula for calculating positional encoding values - pos: token index, i: embedding dimension index


Encoding Process


The encoding process consists of four stages, listed in the order they are processed: multi-head attention, first summation and normalization, position-wise feedforward neural net, and second summation and normalization.


Encoder Multi-head Attention


Multi-head attention is a processing method that performs several attention operations in parallel on the same input and then combines those results. In the encoder, attention uses the input token vectors as query, key, and value vectors.

Self-attention is designed to make encoding or decoding more effective by considering the relevance between neighboring token vectors. First, vectors for each input token are transformed into sets of query, key, and value vectors through three different matrix transformations. Then, the output for each input token is generated by calculating an attention weight vector corresponding to the similarity with surrounding tokens and then combining the value vectors of all tokens, considering these weights.

While self-attention is a key feature of the Transformer and contributes significantly to its performance differentiation, it has a higher computational complexity than traditional CNNs or RNNs.

After complete attention processing in each head, the multi-head attention combines each outcome by adding them arithmetically to produce the output.


Summation and Normalization


The summation and normalization stages in both encoders and decoders operate similarly. This stage not only outputs the result generated from the previous stage but also adds this generated value to the input of the previous stage through a residual sum process. The outcome is then appropriately normalized to control its size. This method, originating from ResNet, effectively calculates gradients needed for learning.


Position-wise Feedforward Neural Net


The position-wise feedforward neural net is a form of neural network that applies weights to the vectors of each token individually. The neural network weights at each position are applied equally regardless of position.


Decoder Operation


The decoder section also proceeds with embedding and positional encoding, followed by decoding processing. Here, the token information entering the embedding is the information of the tokens' previous output. However, in training, target token information is used.


Decoding Process


The decoding process is composed of six stages. The information is processed in the following order: masked multi-head attention, first summation and normalization, multi-head attention, second summation and normalization, position-wise feedforward neural net, and third summation and normalization.


Masked Multi-head Attention


Masked multi-head attention performs attention operations so that vectors corresponding to future tokens in the input sequence are excluded from the attention process through a mask. Therefore, only the output token vectors generated prior to the current time step undergo attention processing.


Decoder Multi-head Attention


This operates in the same functional manner as the encoder multi-head attention. The difference lies in that the keys and values vectors are taken from the final output of the encoder, and only the query vector uses the output from the first summation and normalization step of the decoding process block.


Summation and Normalization


The second summation and normalization step, which comes after the decoder multi-head attention, utilizes both the result of the decoder multi-head attention and the content used for the query input at that stage. The operation is identical across the two encoder segments and three decoder segments.


Types and Utilization of Transformer Modes


The Transformer method has been developed in its original form with both an encoder and decoder, a form with only an encoder, and a form with only a decoder. Recent trends like the GPT method and the Llama series are decoder-only approaches. This section will examine the characteristics, pros and cons, and models of these three methods.


Encoder-Only Approach


The encoder-only approach is primarily used for understanding and judging input information through encoding. Therefore, it is efficient for tasks such as information classification, named entity recognition, and part-of-speech tagging. Pre-training typically involves removing parts of input sentences and predicting the outcome. Models like BERT and DistilBERT are prominent examples.


  • Advantages: Skilled at understanding the context of input information.

  • Disadvantages: Inefficient for generation tasks due to the absence of a decoding part.


Decoder-Only Approach


Used for generating desired forms of output based on received information, making it suitable for text generation, story creation, and dialogue systems. Leveraging more accessible training and lower complexity compared to encoder-decoder methods, it has recently developed to perform favorably in various domains. Pre-training involves predicting the next word or sentence. The GPT series is a notable model in this approach.


  • Advantages: Suitable for generating consistent and contextually appropriate text.

  • Disadvantages: Not efficient in understanding the subtle differences in input information.


Encoder-Decoder Approach


Featuring both an encoder and a decoder, it encompasses the characteristics of both. The application areas are diverse, including translation, summarization, and question-answering. Pre-training uses both approaches: removing part of the sentence and predicting the next phrase. Models like BART and T5 are prominent in this method.


  • Advantages: Effective in understanding input data and generating the desired output.

  • Disadvantages: High complexity and computational load due to two distinct parts with different characteristics. Various tasks are possible, but considering the diverse areas for effective learning makes the training challenging and time-consuming.


Latest Methods


Previously, competition for higher language processing capabilities involved modifications of the Transformer and algorithm improvements, such as increasing the number of encoding and decoding processing stages or the data required for pre-training. Pre-trained models have also evolved to tune finely to perform desired tasks without producing bizarre responses. This involved not only supervised learning but also enhancing performance through reinforcement learning.


Recently, efforts have been made to reduce the growing complexity of performance improvement. The FlashAttention method is being used, which improves calculation speed by optimizing the implementation of attention processing so that most calculations occur between the GPU and HBM. Additionally, the Group Query Attention (GQA) and Sliding Window Attention (SWA) algorithms, which minimize performance sacrifices while reducing the complexity of attention processing, have started to be used.


Implications


Artificial Intelligence witnessed a new turning point with the emergence of deep learning technologies represented by CNN and LSTM. Introducing the Transformer structure and attention algorithms further advanced the performance of large language models. Initially, the complexity and difficulty of learning the original Transformer structure, which includes both an encoder and a decoder, were significant. However, the advent of encoder-only or decoder-only methods has made training relatively easier, and the field of application has been gradually expanding. Recently, improvements in the implementation or processing methods of attention algorithms have been made, bringing forth models that maintain good performance while reducing complexity. It is expected that future developments will continue to optimize computational complexity and improve performance.


References



ree

Comments


bottom of page