Duplicate Checking with Siamese Networks
-
Load in the data:
- Suppose our data consists of Quora questions
- Prepare the training set so each question is paired with a duplicate or non-duplicate:
Question 1 Question 2 Is Duplicate? What is your age? How old are you? true Where are you from? Where are you going? false ... ... ... -
Gather questions into batches:
- Each question in batch should pair with its duplicate question in batch
- Note, there shouldn't be any duplicate questions within a batch
- Meaning, each question in an individual batch should be unique
- The following table illustrates these pairings:
Batch 1 Batch 2 What is your age? How old are you? Where are you from? Where were you born? ... ... -
Preprocess questions
- Maybe convert uppercase to lowercase
- Maybe remove symbols
- Maybe remove numbers
- Maybe remove stop words
- Maybe remove usernames
-
Build a vocabulary:
- Parse each question by adding unique words to the vocabulary
- Again, preprocess the quesstion before building a vocabulary
- The vocabulary is built on training data
- There are many broad vocabularies already built for us
- These pre-built vocabularies may work for our specific use-case
- However, we may be more interested in building our own if our project requires a more individualized vocabulary
-
Convert preprocessed questions to tensors:
- After preprocessing data, we tokenize questions into tensors
- These tensors may have padding based on the length of the longest question within the batch
- Usually, the tensor values are hashes of each word's index in the vocabulary
-
Build supervised model with embedding layer
-
A typical duplicate checking system includes an initial word embedding layer
- An embedding layer is a trainable layer with weights
- Embedding layers are commonly used to map discrete vectors representing an individual question to smaller vectors
- The values in word embeddings usually represent some learned contextual value
- These values become the weights in the embedding layer
-
The output value can be:
- A normalized vector of the LSTM network
-
-
Train the model
-
Train the model using:
- Optimizers like Adam
- Batching techniques like bucketing
- etc.
-
- Pass the two questions into the same LSTM model:
- Compute a cosine similarity score on the normalized output of the two questions
-
Test the model
- Convert each input into an array of numbers
- Feed the inputs into our model
- Compare the outputs and using cosine similarity
- Test against a threshold
General Implementation of Duplicate Checking
# 1. Split data into training set
train_x, train_y, test_x, test_y = load_questions()
# 2. Pair questions into duplicates
train_x, train_y, test_x, test_y = pair_questions()
# 3. Preprocess data
train_x, test_x = preprocess_questions()
# 4. Create Vocabulary
vocab = {'__PAD__': 0, '__</e>__': 1, '__UNK__': 2}
for question in train_x:
processed_question = process_question(question)
for word in processed_question:
if word not in vocab:
vocab[word] = len(vocab)
# 5a. Initialize tokenizer for quesstions
data_pipeline = trax.data.Serial(
trax.data.Tokenize(vocab_file='en_8k.subword', keys=[0]),
trax.data.Shuffle(),
trax.data.FilterByLength(max_length=30, length_keys=[0]),
trax.data.BucketByLength(boundaries=[5,10,30],batch_sizes=[1,5,10],length_keys=[0]),
trax.data.AddLossWeights()
)
# 5b. Tokenize and batch tweets, then output as generator
streamed_batches = data_pipeline(questions)
# 6a. Initiailize normalization function for output layer in LSTM
def normalize(x): # normalizes the vectors to have L2 norm 1
return x / fastnp.sqrt(fastnp.sum(x * x, axis=-1, keepdims=True))
# 6b. Initialize model
model = tl.Serial(
tl.Embedding(vocab_size=8192, d_feature=256),
tl.LSTM(d_model=2) # LSTM layes
tl.Mean(axis=1), # Average on axis 1 (length of sentence)
tl.Fn('Normalize', lambda x: normalize(x)) # Apply normalization
)
# 7a. Initialize triplet loss function
def TripletLoss(margin=0.25):
triplet_loss_fn = partial(TripletLossFn, margin=margin)
return tl.Fn('TripletLoss', triplet_loss_fn)
# 7b. Compute cosine similarity for triplet loss
def TripletLossFn(v1, v2, margin=0.25):
scores = fastnp.dot(v1,v2.T) # pariwise cosine similarity
batch_size = len(scores)
positive = fastnp.diagonal(scores)
negative_without_positive = scores - fastnp.eye(batch_size) * 2.0
closest_negative = negative_without_positive.max(axis = 1)
negative_zero_on_duplicate = (1.0 - fastnp.eye(batch_size)) * scores
mean_negative = fastnp.sum(negative_zero_on_duplicate, axis=1) / (batch_size - 1)
triplet_loss1 = fastnp.maximum(margin - positive + closest_negative, 0 )
triplet_loss2 = fastnp.maximum(margin - positive + mean_negative, 0 )
triplet_loss = fastnp.mean(triplet_loss1 + triplet_loss2)
return triplet_loss
# 7c. Initialize training task with TripletLoss
train_task = training.TrainTask(
labeled_data=streamed_batches,
loss_layer=tl.TripletLoss(),
optimizer=trax.optimizers.Adam(0.01),
n_steps_per_checkpoint=500,
)
# 7d. Initialize evaluation task
eval_task = training.EvalTask(
labeled_data=streamed_eval_batches,
metrics=[tl.WeightedCategoryCrossEntropy(), tl.WeightedCategoryAccuracy()],
n_eval_batches=20 # For less variance in eval numbers
)
# 7e. Prepare training loop and saving checkpoints to output_dir
training_loop = training.Loop(
model,
train_task,
eval_tasks=[eval_task],
output_dir='~/output_file.txt'
)
# 7f. Run 2000 steps (batches)
training_loop.run(2000)
# 8a. Initialize prediction function for duplicates
def predict(question1, question2, threshold, model, vocab, data_generator, verbose):
q1 = nltk.word_tokenize(question1) # tokenize
q2 = nltk.word_tokenize(question2) # tokenize
Q1, Q2 = [], []
for word in q1: # encode q1
Q1 += [vocab[word]]
for word in q2: # encode q2
Q2 += [vocab[word]]
# Call the data generator (built in Ex 01) using next()
Q1, Q2 = next(data_generator([Q1], [Q2], 1, vocab['<PAD>']))
v1, v2 = model((Q1,Q2))
# compute cosine similarity
d = fastnp.dot(v1, v2.T)
# is d greater than the threshold?
res = d > threshold
if(verbose):
print("Q1 = ", Q1, "\nQ2 = ", Q2)
print("d = ", d)
print("res = ", res)
return res
# 8b. Predict duplicates on test data
model = Siamese()
question1 = "Do they enjoy eating the dessert?"
question2 = "Do they like hiking in the desert?"
# 1 means it is duplicated, 0 otherwise
predict(question1 , question2, 0.7, model, vocab, data_generator, verbose=True)
References
- Stanford Deep Learning Lectures
- Stanford Lecture about LSTMs
- Lecture about Training Siamese Networks
- Implementation of a Siamese Network in Trax
- Another Implementation of a Siamese Network in Trax
- Paper about Siamese Networks used in Image Recognition
- Applications of Siamese Networks
- Article about One-Shot Learning
Previous
Next