History of BERT and Other NLP Models
- In the early stages of NLP, we simply wanted to predict the next word in a sentence
-
To do this, we used a continuous bag of words (CBOW) model
- This model is limited to classification based on the input words within a fixed-length sliding window
- Unfortunately, this model excludes the use of many useful context words and relationships with other words in the sentence
-
Then, ELMo was created in 2018 by researchers at the Allen Institute
- This model is a bidirectional LSTM
- Implying, words from the left and right are considered
- This model is able to entirely capture context words and relationships
- However, it still suffered from capturing context in longer sentences
-
Later in 2018, OpenAI introduced the GPT model
- There are versions of GPT: GPT-1, GPT-2, and GPT-3
- All three models are a transformer model
- This model only includes a decoder (no encoders included)
- This model only uses causal attention
- Unfortunately, each GPT model is only unidirectional
- Thus, we can't capture context both leftward and rightward of our target word in a sentences
-
In 2019, Google released the BERT model
- This model is a bidirectional transformer
- Implying, words from the left and right are considered
- This model is able to entirely capture context words and relationships
- This model only includes an encoder (no decoders included)
- This model doesn't suffer from capturing context in longer sentences
-
This model can do the following tasks:
- Next sentence prediction
- Multi-mask language modeling:
-
Around the start of 2020, T5 was introduced by Google
- This model is a bidirectional transformer
- Implying, words from the left and right are considered
- This model is able to entirely capture context words and relationships
- This model includes both an encoder and decoder
- This model doesn't suffer from capturing context in longer sentences
-
This model can do multi-task learning:
- This includes doing multiple different tasks on the same model
- Meaning, we can have a model perform text classification and question answering based on a given label input by the user
Defining Bidirectional Encoder Representations
- This model is also referred to as BERT
- As stated previous, it is a transformer model
- Specifically, it uses bidirectional attention for its attention mechanism
- It also uses positional embeddings to collect information about the positions between words
- Since it has already been pre-trained by Google, we can use BERT for transfer learning
-
At a high-level, the architecture is designed around the following steps:
- Input word embeddings into BERT
- Pass embeddings through to pre-trained encoder-decoder transformer blocks
- Receive output words as predictions
-
The default BERT architecture has the following traits:
- layers ( transformers blocks)
- attention heads
- million parameters
Applications of the BERT Network
-
Google pre-trained BERT using the following tasks:
- Masked language modeling
- Next sentence prediction
-
Thus, if we're only interested in using the pre-trained network, we can do the following:
- Masked language modeling
- Next sentence prediction
-
However, if we're interested in fine-tuning the pre-trained network, we can also do the following:
- Text classification (e.g. sentiment analysis)
- Named-entity recognition (NER)
- Multi-genre natural language inference (MNLI)
- Question answering (SQuAD)
- Sentence paraphrasing
- Text summarization (e.g. article summaries)
-
We can see fine-tuning BERT has the following benefits:
- Less time required for training the fitted model
- Can effectively adjust the task using a smaller dataset
- Only requires minimal task-specific adjustments to use a wider variety of tasks
Describing the Pre-Training Strategy for BERT
-
BERT has been pre-trained on:
-
BooksCorpus data set ( unpublished books)
- Containing million words
-
English Wikipedia data set
- Containing million words
-
-
Pre-training BERT is composed of two tasks:
-
Masked language modeling
- This encodes bidirectional context for representing words
-
Next sentence prediction
- This models the logical relationship between text pairs
-
- By pre-training using the tasks above, the model attains a general sense of the language
-
When pre-training, masked language modeling involves the following:
-
Choose of the tokens at random
- Mask them of the time
- Replace them with a random token of the time
- Keep the token as-is of the time
-
- There can be multiple masked spans in a sentence
Describing Input Representations for BERT
-
The input of a BERT model depends on the task of interest
-
Tasks like masked language and next sentance modeling were pre-trained on BERT
- Thus, there isn't any fine-tuning necessary for training or testing our own next-sentance model with our own inputs
- Tasks like sentiment analysis require some fine-tuning
- Regardless of task, the embeddings of the input sequence processed by BERT are the sum of the token embeddings, segment embeddings, and positional embeddings
-
Defining Input Representations for BERT
-
For tasks like masked language and next sentence modeling:
- Training can run on the pre-trained model without fine-tuning
-
The input is an array of tokenized words from two different sentences:
- Starting with a
<cls>
token - Sentance is separated by sentence using a
<sep>
token - Ending with a
<sep>
token
- Starting with a
- The following is an example of this input:
- Here, tokens from sentence are
- And, tokens from sentence are
-
For tasks like text classification and sentiment analysis:
- Training must run on the pre-trained model with fine-tuning
-
The input is an array of tokenized words from one sentence and its corresponding sentiment:
- Starting with a
<cls>
token - A sentance is separated by its sentiment using a
<sep>
token - Ending with a
<sep>
token
- Starting with a
- The following is an example of this input:
- Here, tokens from our sentence are
- And, sentiments from our sentence are
Describing the Objective of BERT
- For the multi-mask language model, a cross entropy loss function is used to predict the masked word or words
- For the next-sentence prediction model, a binary loss function is used to predict whether a given sentence should follow a target sentence
- Both of these loss outputs are added to each other to produce a final loss output
Defining the GLUE Benchmark
- The GLUE benchmark stands for general language understanding evaluation
- The GLUE benchmark is one of the most popular benchmarks in NLP
- It is used to train, test, and analyze NLP tasks
-
It is a collection of benchmark tools consisting of:
- A benchmark of nine different language comprehension tasks
- An ancillary data set
- A platform for evaluating and comparing the models
-
It is used for various types of NLP tasks:
- Verifying whether a sentence is grammatical
- Verifying the accuracy of sentiment predictions
- Verifying the accuracy of paraphrasing text
- Verifying the similarity between two texts
- Verifying whether two questions are duplicates
- Verifying whether a question is answerable
- Verifying whether a question is a contradiction
-
Usually, it is used with a leaderboard
- This is so people can see how well their model performs compared to other models on a dataset
-
The GLUE benchmark has the following advantages:
-
The GLUE benchmark is model-agnostic
- Doesn't matter if we're evaluating a transformer or LSTM
- Makes use of transfer learning
- Most research uses the GLUE benchmark as a standard
-
References
- Stanford Deep Learning Lectures
- Stanford Lecture about LSTMs
- Lecture about Types of Transfer Learning
- Lecture about the History of Neural Networks in NLP
- Lecture about Defining the BERT Model
- Lecture about Intuition of BERT Tasks
- Lecture about BERT Applications
- Textbook Chapter about BERT
- Textbook Chapter about Pre-Training BERT
- Post about Pre-Training and Fine-Tuning Networks
- Paper about Alignment and Attention Models