Data Science

Motivating Padding in Neural Networks

When training NLP neural networks, training sets typically are separated into batches
Observations within a batch all must be the same length
- However, observations in different batches can be different lengths
To do this, zeroes are appended to the end of each tensor until each tensor equals the length of the largest tensor within its batch
- Here, a tensor represent an encoded tweet, sentence, or any other sequence of words

Motivating Bucketing in Neural Networks

When training on batches with padded tensors, the zeroes are ignored when gradients are computed
However, forward propagating over tensors with many zeroes in an RNN can slow down learning
- Sometimes, we can into vanishing gradients
In other words, training on padded tensors can impact the accuracy and performance
Therefore, our goal is to minimize zero padding

Minimizing Zero Padding with Bucketing

Normally, batches are initialized by grouping observations at random
As a result, we'll group together tensors of all different lengths
This isn't an efficient approach for minimizing zero padding
Instead, we can use a technique called bucketing to minimize zero padding

Describing Bucketing in Detail

Bucketing is performed on a group of batches
Specifically, each batch is associated with a bucket
Each bucket represents a range of lengths of tensors
A training set using batching and bucketing works in the following way:
- Each batch is assigned a batch size and bucket dimension
- Each training tensor is assigned to a suitable batch
Specifically, a training sensor is assigned to a batch that satisfies a bucket dimension

Defining Bucketing Dimensions and Batch Size

A bucketing dimension refers to a range of suitable tensor lengths
A batch size refers to the number of tensors that must be included in the batch for further training
The number of buckets and size of buckets are adjustable hyperparameters
The number of batches and the size of batches are adjustable hyperparameters

Sample Data Pipeline in Trax

Before training our neural network, we typically need to run word embedding functions on the training set
In particular, data pipelines are used for the following general reasons:
- Tokenizing and encoding raw data to tensors
- Shuffling tensors
- Filtering out tensors that don't satisfy a length requirement
- Padding tensors
- Bucketing training tensors into batches

# tweets.txt
#
# good morning, 1
# hello world, 1
# please stop that, 0
# thank you, 1
# a b c d e f g h i j, 0
#

# Read in tuples of tweets and sentiments
with open('tweets.txt') as f:
    tweets = iter([tuple([str(i.split(',')[0]), np.int64(i.split(',')[1])]) for i in f])

# Initialize data pipeline for preprocessing raw data
data_pipeline = trax.data.Serial(
    trax.data.Tokenize(vocab_file='en_8k.subword', keys=[0]),
    trax.data.Shuffle(),
    trax.data.FilterByLength(max_length=30, length_keys=[0]),
    trax.data.BucketByLength(boundaries=[5,10,30],batch_sizes=[1,5,10],length_keys=[0]),
    trax.data.AddLossWeights()
    )

# Run pipeline
train_batches_stream = data_pipeline(tweets)
example_batch = next(train_batches_stream)

Describing `Tokenize` in Trax

In Trax, the Tokenize class is used for a handful of embedding functions
Specifically, tokenization only involves encoding sequences of words (e.g. sentences or tweets) as tensors (i.e. numerical vectors)
The Tokenize class doesn't perform preprocessing functions
As a result, any of the following preprocessing functions need to be run before tokenization:
- Lowercasing words
- Removing stop words
- Removing symbols
The Tokenize determines word embeddings based on the hash values of indices of words (or subwords) within a given vocabulary
- Again, it doesn't remove any capitalization, stop words, etc.
- The Trax library provides a comprehensive vocabulary called en_8k.subwords
- This vocabulary will extract subwords from words and assign hash values to those subwords
- It also accounts for both lowercase and uppercase words
- A customeor personalized vocabulary can also be used in the Tokenize class
The following table includes:
- Raw tweets
- Those tweets matched in the en_8k.subword vocab
- Final tokenized tweets after running the Tokenize class

Tweet	Matched with Vocab	Tokenized Tensors
the good evening	[the, good, eve, ning]	[205, 2730, 4750, 1264]
the Good Evening	[the, Goo, d, Even, ing]	[205, 5901, 20, 3207, 561]
the good Evening	[the, good, Even, ing]	[205, 2730, 3207, 561]

Describing `FilterByLength` in Trax

FilterByLength filters out the original tweets that are shorter than a given value representing the maximum character length
Specifically, only tweets with fewer characters than the maximum value are kept
The following table includes the input and output of the FilterByLength call with max_length=10:

All Raw Tweets	Length	Remaining Tweets
hello world	11	filtered
goodbye	7	goodbye
hi there	8	hi there
goodnight	9	goodnight
good afternoon	14	filtered

Describing `BucketByLength` in Trax

The BucketByLength class achieves the following:
1. First, tensors are bucketed into batches
  - For each batch, tensors have a length within the same range
  - In this scenario, a tensor represents a tokenized tweet
2. Second, tensors are padded to match the length of the largest tensor within its batch
  - Thus, all of the tensors in batch are the same length (based on a bucket dimension)

# Sample Implementation for Example
trax.data.BucketByLength(
    boundaries=[5,10],
    batch_sizes=[3,1],
    length_keys=[0]
    )

All Raw Tweets	Sentiment
good morning	1
hello world	1
please stop that	0
thank you	1
a b c d e f g h i j	0

\downarrow

Tensors
[121, 270]
[41, 3]
[740, 12, 601]
[4, 360]
[17, 2, 111, 731, 6, 98, 99, 119, 555, 90]

\downarrow

Batch Index	Batch	Bucket Dimension	Batch Size	Is Padded	Is Filtered
1	[[121, 270, 0], [41, 3, 0], [740, 12, 601]]	1-5	3	Yes	No
2	[[4, 360]]	1-5	3	No	Yes
3	[[17, 2, 111, 731, 6, 98, 99, 119, 555, 90]]	6-10	1	No	No

Notice, each batch only contains tensors that are no larger than its associated bucket dimension
Notice, the tensors within a batch may be smaller than its associated bucket dimension
- This can happen if none of the tensors reach the exact bucket dimension
Notice, a batch will be considered incomplete if it doesn't have the maximum number of tensors needed for its batch
- As a result, its tensors will be removed from any training

References

Layers in Trax

Supervised Training using Trax

Data Pipelines using Trax

Motivating Padding in Neural Networks

Motivating Bucketing in Neural Networks

Minimizing Zero Padding with Bucketing

Describing Bucketing in Detail

Defining Bucketing Dimensions and Batch Size

Sample Data Pipeline in Trax

Describing Tokenize in Trax

Describing FilterByLength in Trax

Describing BucketByLength in Trax

References

Describing `Tokenize` in Trax

Describing `FilterByLength` in Trax

Describing `BucketByLength` in Trax