Transformer Models

Motivating the Transformer Model

  • An RNN has 33 problems stemming from their sequential architecture:

    • Information loss

      • Although GRU/LSTM mitigates the loss, the problem still exists
    • Suffers from the vanishing gradient problem

      • Specifically, the problem caused by longer input sequences
    • Enforces sequential processing of hidden layers

      • Parallelizing RNN computations is nearly impossible
  • The more words we have in our input sequence, the more time it takes decode the sequence

    • The number of steps is equal to TyT_{y}
  • Whereas, transformers support the following benefits:

    • Information loss isn't a problem

      • Since, attention scores are computed in a single step
    • Don't suffer from the vanish gradient problem

      • Specifically, the problem caused by longer input sequences
      • This is because there is only one gradient step
    • Doesn't enforce sequential processing of hidden layers

      • Parallelizing these computations is much easier
      • Since, they don't require sequential computations per layer
      • Meaning, they aren't dependent on previous layers

Describing the Basics of a Transformer

  • A transformer uses an attention mechanism without being an RNN
  • In other words, it is an attention model, but is not an RNN
  • A transformer processes all of the tokens in a sequence simultaneously
  • Then, attention scores are computed between each token
  • A transformer doesn't use any recurrent layers
  • Instead, a transformer uses multi-headed attention layers

Introducing the Architecture of a Transformer

  • A transformer consists of 33 basic components:

    • Positional encoding functions
    • Encoder
    • Decoder

transformer

Defining Positional Encoding Functions

  • An input embedding maps a word to a vector
  • However, the same word in a different sentence may carry a new meaning based on its surrounding words:

    • Sentence 1: I love your dog!
    • Sentence 2: You're a lucky dog!
  • This is why we use positional encoding vectors
  • A positional encoding vector adds positional context to embeddings

    • It adds context based on the position of a word in a sentence
PE(pos,2i)=sin(pos100002idmodel)PE(pos, 2i) = \sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}})
  • Where the formula has the following variables and constants:

    • pospos is the position of a given word in its sentence
    • ii is the index of the ithi^{th} element of the word embedding
    • dmodeld_{model} is the size of the word embedding
  • The following is an example of applying the positional encoding function to the 44-dimensional word embedding of dog:
[0.30.40.70.1]embedding of dog+positional encoding  =  [0.40.60.80.9]positional encoding of dog\underbrace{\begin{bmatrix} 0.3 \\ 0.4 \\ 0.7 \\ 0.1 \end{bmatrix}}_{\text{embedding of dog}} + \quad \boxed{\text{positional encoding}} \; = \; \underbrace{\begin{bmatrix} 0.4 \\ 0.6 \\ 0.8 \\ 0.9 \end{bmatrix}}_{\text{positional encoding of dog}}

Initial Steps of the Encoder

  • So far, we've completed the following steps, including:

    • Embedding our input
    • Performing positional encoding on the embedding
  • Now, we're ready to pass the positional encoding into the encoder
  • This includes performing the following specific steps:

    • Multi-head attention
    • Normalization
    • Feed-forward
  • Before introducing multi-head attention, let's first start off by defining self-head attention

Types of Attention in Neural Networks

  • There are 33 general types of attention:

    • Encoder/decoder attention
    • Causal self-attention
    • Bi-directional self-attention
  • For encoder/decoder attention, a sentence from the decoder looks at another sentence from the encoder
  • For causal attention, words in a sentence (from an encoder or decoder) look at previous words in that same sentence
  • For bi-directional attention, words in a sentence (from an encoder or decoder) look at previous and future words in that same sentence
  • Neither of these attention mechanisms are mutually exclusive
  • Meaning, any attention layer can incorporate any of the 33 types of attention

transformer

Introducing Self-Head Attention in the Encoder

  • Attention is how relevant the ithi^{th} word in the input sequence relative to other words in the input sequence
  • Self-attention allows us to associate it with fruit in the following sentence:
The fruit looks tasty and it looks juicy\text{The fruit looks tasty and it looks juicy}
  • When calculating attention, there are 33 components:

    • A query matrix QQ, consisting of query vectors qiq_{i}

      • qiq_{i} represents a vector associated with a single input word
    • A key matrix KK, consisting of key vectors kik_{i}

      • kik_{i} represents a vector associated with a single input word
    • A value matrix VV, consisting of key vectors viv_{i}

      • viv_{i} represents a vector associated with a single input word
  • Specifically, each individual matrix is the result of matrix multiplication between the input embeddings and 33 matrices of trained weights:
  • These 33 weight matrices include WQW_{Q}, WKW_{K}, and WVW_{V}
  • Again, qiq_{i}, kik_{i}, and viv_{i} refer to identical words of a sequence

    • However, the values associated with qiq_{i}, kik_{i}, and viv_{i} are different
    • This is because WQW_{Q}, WKW_{K}, and WVW_{V} are weight matrices learned during training
  • The formulas for qiq_{i}, kik_{i}, and viv_{i} are defined below:
q1=x1WQk1=x1WKv1=v1WVq_{1} = x_{1} \cdot W_{Q} \\ k_{1} = x_{1} \cdot W_{K} \\ v_{1} = v_{1} \cdot W_{V}
  • The example below has the following properties:

    • Translating a sentence with 22 words: hello and world
    • Each word embedding vector has a length of en=4e_{n} = 4

      • Implying, there are 44 words in our vocabulary
    • Each WQW_{Q}, WKW_{K}, and WVW_{V} is a en×me_{n} \times m matrix

      • Where, mm is an adjustable hyperparameter
      • For now, we'll decide to assign m=3m = 3
    • Each qiq_{i}, kik_{i}, and viv_{i} has a length of mm

transformerdotprod

  • Notice, q1q_{1} is the result of multiplying x1x_{1} and WQW_{Q} together
  • Notice, k1k_{1} is the result of multiplying x1x_{1} and WKW_{K} together
  • Notice, v1v_{1} is the result of multiplying x1x_{1} and WVW_{V} together

Steps for Calculation Self-Attention in the Encoder

  • At a high level, a self-attention block does:

    1. Start with trained weight matrices WQW_{Q}, WKW_{K}, and WVW_{V}
    2. Compute embeddings to determine how similar XX or previous activations are to input words and output words

      • Roughly, QQ represents embeddings of output words
      • Roughly, KK and VV represent embeddings of input words
      • In other words, QQ learns patterns from our output sentence
      • And, KK and VV learns patterns from our input sentence
    3. Convert these similarity scores to probabilities

      • This is because probabilities are easier to understand
  • At a lower level, a self-attention block does:

    1. Receives trained weight matrices WQW_{Q}, WKW_{K}, and WVW_{V}
    2. Computes QQ, KK, and VV:

      • Q=XWQQ = X \cdot W_{Q}
      • K=XWKK = X \cdot W_{K}
      • V=XWVV = X \cdot W_{V}
    3. Computes alignment scores by qikjq_{i} \cdot k_{j}

      • Here, we're looking for keys kjk_{j} (input words) that are similar with our query qiq_{i} (output word)
    4. Divide alignment scores by dk=4\sqrt{d_{k}} = \sqrt{4} for more stable gradients
    5. Pass values from our previous step through a softmax function

      • This formats each alignment score as a probability
    6. Compute dot product of visoftmax valuev_{i} \cdot \text{softmax value} to get ZZ matrix

      • Here, ZZ represents our attention scores
      • Roughly, VV represents how similar words from QQ and KK are
      • Multiplying softmax values by VV represents weighting each value vjv_{j} by the probability that kjk_{j} matches the query qiq_{i}

Understanding Dimensions of QQ, KK, and VV

  • Suppose we're translating an English sentence to a German sentence
  • QQ has the following dimensions:

    • Number of rows == number of words in the German sentence LQL_{Q}
    • Number of columns == a proposed length DD of the embedding
  • KK has the following dimensions:

    • Number of rows == number of words in the English sentence LKL_{K}
    • Number of columns == a proposed length DD of the embedding
  • VV has the following dimensions:

    • Number of rows == number of words in the English sentence LKL_{K}
    • Number of columns == a proposed length DD of the embedding
  • A=softmax(QKT)A=\text{softmax}(QK^{T}) has the following dimensions:

    • Number of rows == number of words in the German sentence LQL_{Q}
    • Number of cols == number of words in the English sentence LKL_{K}
  • Z=AVZ=AV has the following dimensions:

    • Number of rows == number of words in the German sentence LQL_{Q}
    • Number of columns == a proposed length DD of the embedding
  • Here, DD is an adjustable hyperparameter
  • Mathematically, these matrices are denoted as the following:
Q:LQ×DK:LK×DV:LK×DA:LQ×LKZ:LQ×DQ: L_{Q} \times D \\ K: L_{K} \times D \\ V: L_{K} \times D \\ A: L_{Q} \times L_{K} \\ Z: L_{Q} \times D

Calculating Self-Attention in the Encoder

  • qiq_{i}, kik_{i}, and viv_{i} are simply abstractions that are useful for calculating and thinking about attention

    • However, they are not attention scores themselves
    • Each row of QQ, KK, and VV are associated with an input word
    • Meaning, qiq_{i}, kik_{i}, and viv_{i} refer to the ithi^{th} input word
  • For now, let's focus on calculating the attention scores for the first word x1x_{1} relative to the other words in the sentence (i.e. x2x_{2})
  • To calculate attention scores, we just take the dot product of qikjq_{i} \cdot k_{j}

    • This determines how relevant word qiq_{i} is to word kjk_{j}
  • The following diagram illustrates computing attention scores for:

    • q1q_{1} and k1k_{1}
    • q1q_{1} and k2k_{2}

transformerdotprod

  • Next, we'll normalize the attention scores by:

    1. Dividing them by dk\sqrt{d_{k}}

      • This leads to having more stable gradients
    2. Then, passing those outputs through the softmax function

      • The softmax function formats values as probabilities
      • Thus, the output determines how much each word will be expressed for this input word x1x_{1}
    3. Computing the dot product of the softmax value and each viv_{i}

      • This keeps relevant words and drowns out irrelevant words

transformerdotprod

  • The steps can be condensed using matrix multiplication
  • Thus, each row of ZZ is a vector for each word
  • Again, ZZ represents the attention scores
  • The following computes self-attention on x1x_{1} and x2x_{2}:

transformerdotprod

Improving Self-Attention with Multi-Headed Attention

  • The transformer paper refined the self-attention layer by adding a mechanism called multi-headed attention
  • This improves the accuracy and performance of self-attention:

    • Accuracy: Different sets of QhQ_{h}, KhK_{h}, and VhV_{h} can learn different contexts and patterns in the sentence

      • We'll know it refers to fruit here:
      The fruit looks tasty and it looks juicy\text{The fruit looks tasty and it looks juicy}
    • Performance: Simply splitting QQ, KK, and VV into smaller heads allows us to parallelize matrix multiplication and other computations on these heads
  • Without using multi-headed attention, self-attention will weight always assign the most attention to itself

    • Consequently, it becomes useless
    • Which is why we include 88 different sets of heads
    • Then, we'll be able to pick up on more interesting contexts
  • Although additional sets of queries/keys/values are added, they can be processed in parallel with each other
  • The number of sets of queries/keys/values are determined by the number of attention heads

    • This number is an adjustable hyperparameter
    • For future examples, we'll assign this number h=8h=8, since this is the default number specified in the transformer paper
  • The mult-headed attention mechanism follows the same steps as the steps in self-attention
  • However, multi-headed attention layer makes a few adjustments:

    1. Receives trained weight matrices WQW_{Q}, WKW_{K}, and WVW_{V}
    2. Compute h=8h=8 sets of QQ, KK, and VV:

      • Q=XWQQ = X \cdot W_{Q}
      • K=XWKK = X \cdot W_{K}
      • V=XWVV = X \cdot W_{V}
    3. Computes attention scores by qikjq_{i} \cdot k_{j}
    4. Divide attention scores by dk=4\sqrt{d_{k}} = \sqrt{4} for more stable gradients
    5. Pass the values in the previous step through the softmax function
    6. Compute dot product of visoftmax valuev_{i} \cdot \text{softmax value} to get ZZ matrix
    7. Stack the 88 ZZ matrices together
    8. Multiply stacked ZZ matrix by another trained WOW_{O} matrix ZWOZ \cdot W_{O} to get ZfinalZ_{final}

multiattention

Defining the Masked Multi-Headed Attention Blocks

  • The masked multi-headed attention block is essentially the same as an attention block
  • However, the decoder must receive the information from the encoder
  • Otherwise, we wouldn't have learned anything from the encoder
  • Therefore, the masked multi-headed attention block receives:

    • Information from encoder: all of the words from the english sentence
    • Information from previous decoder layers: previous words in the translated sentence
  • Thus, the masked multi-headed attention block masks upcoming words in the translated sentence by replacing them with 00
  • Then, the attention network won't use them

Training Multi-Headed Attention Blocks

  • QQ, KK, and VV are trained and used in the encoder's multi-headed attention block
  • Then, KK and VV are passed to the decoder's masked multi-headed attention block

    • In the multi-headed attention block, its own individual QQ is trained
  • Lastly, the same KK and VV are passed again to the decoder's second multi-headed attention block

    • The masked multi-headed attention block passes its QQ to this multi-headed attention block
  • This is a crucial step in explaining how the representation of two languages in an encoder are mixed together
  • In summary, there are 22 QQ matrices trained

    • One QQ is trained and used in the encoder
    • Another QQ is trained and used in the decoder
  • And, there is only 11 KK and 11 VV matrix trained

    • They are trained and used in the encoder and decoder

Illustrating Multi-Headed Attention after Training

  • After calculating each of the 88 heads, we can refer to the attention heads relative to individual words to determine what each head focuses on
  • In the following example, we encode the word it
  • Notice, one attention head focuses most on the animal
  • Another attention head focuses on tired
  • In a sense, the model's representation of the word it bakes in some of the representation of both animal and tired

multiattention

  • In the above example, notice how we're only focused on the 2nd2^{nd} and 3rd3^{rd} heads
  • By comparing particular heads with each other, we can get a decent idea about what other words each head focuses on relative to an individual word
  • Notice, if we focus on all of the heads, any analysis becomes less interpretable:

multiattention

Illustrating the Encoder and Decoder during Training

  • The input of the encoder is an input sequence
  • The output of the encoder is a set of attention vectors KK and VV
  • These 22 matrices are used by the decoders in its attention layers
  • This helps the decoder focus on appropriate places in the input sequence
  • Refer here for more detailed illustrations of the encoder and decoder

Applications of a Transformer Model

  • Automatic text summarization
  • Auto-completion
  • Named entity recognition (NER)
  • Question answering
  • Machine translation
  • Chat bots
  • Sentiment analysis
  • Market intelligence
  • Text classification
  • Character recognition
  • Spell checking

Popular Types of Transformers

  • GPT-3

    • Stands for generative pre-training for transformer
    • Created by OpenAI with pre-training in 2020
    • Used for text generation, summarization, and classification
  • T5

    • Stands for text-to-text transfer transformer
    • Created by Google in 2019
    • Used for question answering, text classification, question answering, etc.
  • BERT

    • Stands for bidirectional encoder represeentations from transformers
    • Created by Google in 2018
    • Used for created text representations

tldr

  • The positional encoding vectors find patterns between positions of words in the sentences
  • The attention layers find patterns between words in the sentences
  • Normalization helps speed up training and processing time
  • Multi-headed attention layers only involve to matrix multiplication operations
  • Multi-headed attention improves both performance and accuracy:

    • Accuracy: More heads imply more detectable patterns between words
    • Performance: These heads can be computed and trained in parallel using GPUs
  • Multi-headed attention scores are formatted as probabilities using the softmax function
  • The fully connected feed-forward layer in the encoders and decoders are trained using ReLU activation functions
  • Roughly, QQ represents embeddings of output words

    • QQ learns patterns from our output sentence
  • Roughly, KK and VV represent embeddings of input words

    • KK and VV learns patterns from our input sentence
    • VV represents how similar words from QQ and KK are
  • The encoder and decoder use the same KK and VV
  • The encoder and decoder has its own QQ

References

Previous
Next

Attention Models

Transfer Learning