Latent Dirichlet Allocation

Motivating Latent Dirichlet Allocation

  • A topic refers to some artificial group or category that the words could belong to
  • A token refers to individual words
  • A document refers to some individual text
  • A corpus refers to a collection of texts

Describing the LDA Model

  • In natural language processing, latent Dirichlet allocation (or LDA) is a probabilistic model involving bayesian statistics
  • The model typically takes in a hyperparameter representing our desired number of topics
  • The model uses documents (or any individual pieces of text) as input
  • The model outputs matrices representing:

    1. The probability that a document belongs to a topic
    2. The probability that a word belongs to a topic
  • The idea of the model goes like this:

    • We collect a bunch of observations, which are words (collected into documents)
    • Documents are thought to belong to some underlying topic to a certain degree
    • Topics are formed by grouping similar words from the collection of documents
    • We measure a document's degree of belonging to a topic by examining its words and determining which topic those words most closely align to

Approaching an LDA Problem

  • The goal of LDA is to compute the 2 hidden parameters often commonly denoted as θ\theta and ϕ\phi

    • Where θ\theta represents topic-document distribution
    • Where ϕ\phi represents the word-topic distribution
  • The posterior distribution for these parameters is intractable, since we cannot compute the integrals analytically in a high dimensional space
  • Therefore, most advocate one of the two methods:

    1. Variational Inference

      • This seems to be the preferred method for getting fast and accurate results in most software implementations
    2. MCMC Methods

      • These have the benefit of being unbiased and easy understand
      • However, it seems MCMCs are handicapped by being computationally inefficient when working with large corpora

The LDA Algorithm

  1. Create a list of lists, where each list represents some document (or individual text entry) and each entry within that list represents an individual word from that document
[[Eat,turkey,on,holidays,always],[[“Eat”, “turkey”, “on”, “holidays”, “always”], [I,eat,cake,on,holidays],[“I”, “eat”, “cake”, “on”, “holidays”], [Turkey,eat,chickens]][“Turkey”, “eat”, “chickens”]]
  1. Preprocess by removing stop words, removing punctuation, and lemmatizing words
[[eat,turkey,holiday,always],[[“eat”, “turkey”, “holiday”, “always”], [eat,cake,holiday],[“eat”, “cake”, “holiday”], [turkey,eat,chicken]][“turkey”, “eat”, “chicken”]]
  1. One hot encoding
[[1,2,3,4],[1,5,3],[2,1,6]][[1,2,3,4], [1,5,3], [2,1,6]]
  1. Randomly assign a topic to each word to initialize a word-topic matrix
[[1,1,2,2],[1,2,1],[2,2,1]][[1,1,2,2], [1,2,1], [2,2,1]] where the number of topics is 2\text{where the number of topics is 2}
  1. Generate a word-topic count matrix

    • Word-topic matrices represent an n by m dimensional matrix, where n is the number of distinct topics, m is the number of distinct words, and each entry contains a count of the words assigned to each topic
    • In other words, each row represents a conditional frequency distribution of words given a topic
    • If we sum up some column, we should receive the frequency of a word across all documents
    • If we sum up some row, we should receive the frequency of a topic
[2,1,1,0,0,1][2,1,1,0,0,1] [1,1,1,1,1,0][1,1,1,1,1,0] where the number of topics is 2 (2 rows)\text{where the number of topics is 2 (2 rows)} and the number of distinct words is 6 (6 columns)\text{and the number of distinct words is 6 (6 columns)}
  1. Generate a document-topic count matrix

    • Document-topic matrices represent an n×mn \times m dimensional matrix

      • Where nn is the number of distinct documents
      • Where mm is the number of distinct topics
      • And each entry contains a count of the topics assigned to each document
    • In other words, each row represents a conditional frequency distribution of topics given a document
    • If we sum up some column, we should receive the frequency of a topic
    • If we sum of some row, we should receive the frequency of words in a specific document
[2,2][2,2] [2,1][2,1] [1,2][1,2] where the number of documents is 3 (3 rows)\text{where the number of documents is 3 (3 rows)} and the number of topics is 2 (2 columns)\text{and the number of topics is 2 (2 columns)}
  1. Assign better topics to words (instead of our initial, randomly-assigned topics)

    • We do this by simulating the probability that a word belongs to a topic using Gibbs sampling and by determining if tokens belong to a topic at the start of an iteration and comparing if those tokens belong to the same topic in other documents
[[2,1,1,2],[2,1,1],[1,2,2]][[2,1,1,2], [2,1,1], [1,2,2]]
  1. Recompute the word-topic matrix
[0,2,2,0,1,0][0,2,2,0,1,0] [2,0,0,1,0,1][2,0,0,1,0,1]
  1. Recompute the document-topic matrix
[2,2][2,2] [2,1][2,1] [1,2][1,2]

10a. We can now calculate the probability that a document belongs to each topic by summing the number of words in a certain topic and dividing that quantity by the total number of words within the document

[0.5,0.5][0.5,0.5] [0.66,0.33][0.66,0.33] [0.33,0.66][0.33,0.66]

10b. We can also calculate the probability of a word belonging to each topic by summing the number of words in a certain topic and dividing that quantity by the total number number of words in that topic

[0,0.4,0.4,0,0.2,0][0,0.4,0.4,0,0.2,0] [0.5,0,0,0.25,0,0.25][0.5,0,0,0.25,0,0.25]

References

Previous
Next

TF-IDF

POS Tagging and NER