Describing TF-IDF
- Tf-idf stands for term frequency-inverse document frequency
- Tf-idf is a measurement of how important a given word is in a given document, either in a collection of documents or corpus
- Generally, the importance increases as the frequency of the words in its given document increases, but if offset by the frequency of the word in the corpus
The Algorithm
The tf-idf weight is composed of two terms:
A weight representing the number of times a word appears in a document, divided by the total number of words in that documentIDF:
A weight representing the logarithm of the total number of documents in the corpus divided by the number of documents where the specific term appears
Defining Term Frequency
- In words, term frequency measures how frequenty a term occurs in a document
Defining Inverse Document Frequency
- In words, inverse document frequency measures how important a term is
An Example of TF-IDF
- We want to know how important the term cat appears in a document, relative to the entire collection of documents
- Therefore, we want to know the td-idf weight
The importance of the term cat in our document, or the term frequency, is the following:
- Since of the words in our document is the term cat
The importance of the term cat relative to the entire collection of our documents, or the inverse document frequency, is the following:
- Since of the documents from our collection of documents contains the term cat
- Thus, the tf-idf weight is the product of these quantities: