Motivating Latent Dirichlet Allocation
- A topic refers to some artificial group or category that the words could belong to
- A token refers to individual words
- A document refers to some individual text
- A corpus refers to a collection of texts
Describing the LDA Model
- In natural language processing, latent Dirichlet allocation (or LDA) is a probabilistic model involving bayesian statistics
- The model typically takes in a hyperparameter representing our desired number of topics
- The model uses documents (or any individual pieces of text) as input
-
The model outputs matrices representing:
- The probability that a document belongs to a topic
- The probability that a word belongs to a topic
-
The idea of the model goes like this:
- We collect a bunch of observations, which are words (collected into documents)
- Documents are thought to belong to some underlying topic to a certain degree
- Topics are formed by grouping similar words from the collection of documents
- We measure a document's degree of belonging to a topic by examining its words and determining which topic those words most closely align to
Approaching an LDA Problem
-
The goal of LDA is to compute the 2 hidden parameters often commonly denoted as and
- Where represents topic-document distribution
- Where represents the word-topic distribution
- The posterior distribution for these parameters is intractable, since we cannot compute the integrals analytically in a high dimensional space
-
Therefore, most advocate one of the two methods:
-
Variational Inference
- This seems to be the preferred method for getting fast and accurate results in most software implementations
-
MCMC Methods
- These have the benefit of being unbiased and easy understand
- However, it seems MCMCs are handicapped by being computationally inefficient when working with large corpora
-
The LDA Algorithm
- Create a list of lists, where each list represents some document (or individual text entry) and each entry within that list represents an individual word from that document
- Preprocess by removing stop words, removing punctuation, and lemmatizing words
- One hot encoding
- Randomly assign a topic to each word to initialize a word-topic matrix
-
Generate a word-topic count matrix
- Word-topic matrices represent an n by m dimensional matrix, where n is the number of distinct topics, m is the number of distinct words, and each entry contains a count of the words assigned to each topic
- In other words, each row represents a conditional frequency distribution of words given a topic
- If we sum up some column, we should receive the frequency of a word across all documents
- If we sum up some row, we should receive the frequency of a topic
-
Generate a document-topic count matrix
-
Document-topic matrices represent an dimensional matrix
- Where is the number of distinct documents
- Where is the number of distinct topics
- And each entry contains a count of the topics assigned to each document
- In other words, each row represents a conditional frequency distribution of topics given a document
- If we sum up some column, we should receive the frequency of a topic
- If we sum of some row, we should receive the frequency of words in a specific document
-
-
Assign better topics to words (instead of our initial, randomly-assigned topics)
- We do this by simulating the probability that a word belongs to a topic using Gibbs sampling and by determining if tokens belong to a topic at the start of an iteration and comparing if those tokens belong to the same topic in other documents
- Recompute the word-topic matrix
- Recompute the document-topic matrix
10a. We can now calculate the probability that a document belongs to each topic by summing the number of words in a certain topic and dividing that quantity by the total number of words within the document
10b. We can also calculate the probability of a word belonging to each topic by summing the number of words in a certain topic and dividing that quantity by the total number number of words in that topic