Motivating Sentiment Analysis
- Let's say we're interested in classifying tweets as either positive or negative
-
To do this, the following types of models work well:
-
Probabilistic Models
- Logisitic Regression
- Naive Bayes
-
Sequence Models
- Recurrent Neural Network
-
-
Sentiment analysis typically involves the following steps:
- Feature preprocessing
- Building a vocabulary
- Modeling sentiment based on given words
Defining Pre-Modeling Steps
- Before defining any typical models used in sentiment analysis, we'll define the steps that are necessary in the setup
-
Specifically, we'll dive into the details for each of the following:
- Feature preprocessing
- Building a vocabulary
Step 1: Feature Preprocessing
- The first step in sentiment analysis is feature preprocessing
- This needs to be done in both the training and testing data
-
The most important preprocessing rules include:
- Removing stop words
- Formatting uppercase words as lowercase words
- Removing punctuation
- Handling URLs
-
We should be careful, since the above formatting changes can also remove a degree of context, such as the following:
-
Removing punctuation can remove a degree of context
- Emoticons
- URLs
-
Removing words can remove a degree of context for neighboring words
- not good
- do not want
-
Ignoring word order can remove a degree of context
- I am happy because I did not go
- I am not happy because I did go
-
Adversial attacks
- Sarcasm
- Irony
- Euphemisms
-
Step 2: Building a Vocabulary
- The next step in sentiment analysis is building a vocabulary
- Meaning, we create a unique set of words from our training data
- For example, say we have three tweets in our training dataset:
- Then, our vocabulary becomes the following:
Step 3: Modeling
- The last step in sentiment analysis is feature extraction
-
As stated previously, we have a few options for modeling sentiment:
- Logisitic Regression
- Probabilistic Modeling: Naive Bayes
- Sequence Modeling: Recurrent Neural Networks
-
Typically, logistic regression has the following properties:
- None of the context of an input sequence is captured
- Training is relatively fast
-
Typically, naive bayes has the following properties:
- Only some of the context of an input sequence is captured
- Training is much slower
-
Typically, RNNs have the following properties:
- The entire context of an input sequence is captured
- Training is the slowest
Sentiment Analysis using Logistic Regression
- One way of building a model for sentiment classification is to create an individual parameter for each variable in our vocabulary
- The following model reflects this approach for :
-
Where the variables represent the following:
- is a binary value, where if I is included in the tweet
- is a weight for
- This approach has many flaws, since there are too many hyperparameters to learn
- Also, the data will become extremely sparse, indicating we should try a different approach
- Specifically, a more computationally efficient approach involves creating a model with only two hyperparameters:
-
Where the variables represent the following:
- is a whole number equaling the frequency of this word appearing in positive tweets
- is a whole number equaling the frequency of this word appearing in negative tweets
General Implementation using Logistic Regression
# Format tweets by removing stop words, punctuation, etc.
clean_tweets = format_tweets(tweets)
# Build vocabulary on training data
vocab = build_vocab(clean_tweets)
# Build dictionary of frequencies
freqs = build_freqs(clean_tweets, vocab)
# Use logistic regression to classify sentiment
weights = train_log_reg(freqs, labels)
# Preprocess test data
clean_new_tweets = format_tweets(new_tweet)
# Test trained model
sentiment = test_model(clean_new_tweet, weights)
Sentiment Analysis using Naive Bayes
-
Determine the size of our n-gram:
- For this example, let's assign the size of our n-gram to be
- Determine the sentiment for each sequence of two words in our training dataset:
- Calculate the frequencies of the n-grams in our training dataset:
Positive | Negative | |
---|---|---|
i am | 2 | 1 |
am happy | 1 | 0 |
happy because | 1 | 0 |
because i | 1 | 0 |
am learning | 1 | 0 |
learning nlp | 1 | 0 |
am sad | 0 | 1 |
- Convert the frequencies into conditional probabilities:
i am | ||
am happy | ||
happy because | ||
because i | ||
am learning | ||
learning nlp | ||
am sad |
- Perform laplacian smoothing to avoid :
i am | ||
am happy | ||
happy because | ||
because i | ||
am learning | ||
learning nlp | ||
am sad |
-
Compute naive bayes for testing any new sentance:
- Note, new n-grams are simply excluded from the calculation
-
Calculate the log likelihood of the final output with a prior term:
- Note, the prior term refers to
- This will improve the computation performance
- Specifically, taking the products of many decimals runs the risk of observing underflow
- Meaning, a certain decimal may not fit into memory
- Classify a given sentance by multiplying the probabilities of words found within the sentance:
-
Make sure the dataset is balanced
- Meaning, the log of the prior term should equal
- In this example, our dataset isn't well-balanced
General Implementation of Naive Bayes
# Format tweets by removing stop words, punctuation, etc.
clean_tweets = format_tweets(tweets)
# Build vocabulary on training data
vocab = build_vocab(clean_tweets)
# Build dictionary of frequencies
freqs = build_freqs(clean_tweets, vocab)
# Apply laplacian smoothing to avoid 0s
smooth_freqs = lap_smooth(freqs)
# See if dataset is balanced
log_prior(smooth_freqs) == 0
# Use naive bayes to classify sentiment
weights = train_nb(smooth_freqs, labels)
# Preprocess test data
clean_new_tweets = format_tweets(new_tweet)
# Test trained model
sentiment = test_model(clean_new_tweet, weights)
Sentiment Analysis using Recurrent Neural Networks
-
Load in the data:
- Let's say we have positive tweets and negative tweets
- Suppose our training set contains of the total tweets and our test set contains of the total tweets
- We need to make sure the positive and negative tweets are evenly distributed across both training and test sets
-
Preprocess tweets
- Maybe convert uppercase to lowercase
- Maybe remove symbols
- Maybe remove numbers
- Maybe remove stop words
- Maybe remove usernames
-
Build a vocabulary:
- Parse each tweet by adding unique words to the vocabulary
- Again, preprocess the tweet before building a vocabulary
- The vocabulary is built on training data
- There are many broad vocabularies already built for us
- These pre-built vocabularies may work for our specific use-case
- However, we may be more interested in building our own if our project requires a more individualized vocabulary
-
Convert preprocessed tweets to tensors:
- After preprocessing data, we tokenize tweets into tensors
- These tensors may have padding based on the length of the longest tweet within the batch
- Usually, the tensor values are hashes of each word's index in the vocabulary
-
Build supervised model with embedding layer
-
A typical sentiment analysis model includes an initial word embedding layer
- An embedding layer is a trainable layer with weights
- Embedding layers are commonly used to map discrete vectors representing an individual tweet to smaller vectors
- The values in word embeddings usually represent some learned contextual value
- These valuess become the weights in the embedding layer
-
The output value can be:
- A binary value representing positive or negative sentiment
- A decimal between and representing the probability of positive sentiment
- The output layer could be a sigmoid or softmax layer
-
-
Train the model
-
Train the model using:
- Optimizers like Adam
- Batching techniques like bucketing
- etc.
-
-
Test the model
- Predict sentiments using new data
General Implementation of Sentiment Analysis
# 1a. Load the data
pos_tweets, neg_tweets = load_tweets()
# 1b. Split data into training set
train_pos = pos_tweets[:4000]
train_neg = neg_tweets[:4000]
train_x = train_pos + train_neg
# 1c. Split data into test set
test_pos = pos_tweets[4000:]
test_neg = neg_tweets[4000:]
test_x = test_pos + test_neg
# 1d. Create labels for training and test sets
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))
# 2. Preprocess data
train_x, test_x = preprocess_tweets()
# 3. Create Vocabulary
vocab = {'__PAD__': 0, '__</e>__': 1, '__UNK__': 2}
for tweet in train_x:
processed_tweet = process_tweet(tweet)
for word in processed_tweet:
if word not in vocab:
vocab[word] = len(vocab)
# 4a. Initialize tokenizer for tweets
data_pipeline = trax.data.Serial(
trax.data.Tokenize(vocab_file='en_8k.subword', keys=[0]),
trax.data.Shuffle(),
trax.data.FilterByLength(max_length=30, length_keys=[0]),
trax.data.BucketByLength(boundaries=[5,10,30],batch_sizes=[1,5,10],length_keys=[0]),
trax.data.AddLossWeights()
)
# 4b. Tokenize and batch tweets, then output as generator
streamed_batches = data_pipeline(tweets)
# 5. Initialize model
model = tl.Serial(
tl.Embedding(vocab_size=8192, d_feature=256),
tl.Mean(axis=1), # Average on axis 1 (length of sentence)
tl.Dense(2), # Classify 2 classes
tl.LogSoftmax() # Produce log-probabilities
)
# 6a. Initialize training task
train_task = training.TrainTask(
labeled_data=streamed_batches,
loss_layer=tl.WeightedCategoryCrossEntropy(),
optimizer=trax.optimizers.Adam(0.01),
n_steps_per_checkpoint=500,
)
# 6b. Initialize evaluation task
eval_task = training.EvalTask(
labeled_data=streamed_eval_batches,
metrics=[tl.WeightedCategoryCrossEntropy(), tl.WeightedCategoryAccuracy()],
n_eval_batches=20 # For less variance in eval numbers
)
# 6c. Prepare training loop and saving checkpoints to output_dir
training_loop = training.Loop(
model,
train_task,
eval_tasks=[eval_task],
output_dir='~/output_file.txt'
)
# 6d. Run 2000 steps (batches)
training_loop.run(2000)
# 7. Predict sentiment on test data
example_input = next(eval_batches_stream)[0][0]
example_input_str = trax.data.detokenize(example_input, vocab_file='en_8k.subword')
sentiment_log_probs = model(example_input[None, :]) # Add batch dimension
print(f'example input_str: {example_input_str}')
print(f'Model returned sentiment probabilities: {np.exp(sentiment_log_probs)}')