What is TF-IDF

TF-IDF stands for Term Frequency-Inverse Document Frequency. It’s a numerical statistic used to indicate the importance of a word in a document relative to a collection of documents, often called a corpus. TF-IDF is commonly used in the field of information retrieval and text mining.

Here’s a breakdown:

  1. Term Frequency (TF):
  • This represents the frequency of a term in a particular document.
  • ( \text{TF}(t, d) ) = Number of times term ( t ) appears in document ( d ) divided by the total number of terms in document ( d ).
  • Formula:
    [ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} ]
  1. Inverse Document Frequency (IDF):
  • This reflects the importance of the term in the entire corpus.
  • The idea is that terms that appear in many documents are less informative than those that appear in a smaller number of documents.
  • Formula:
    [ \text{IDF}(t, D) = \log \frac{\text{Total number of documents in corpus } D}{\text{Number of documents containing term } t} ]
  1. TF-IDF Score:
  • It’s the product of TF and IDF.
  • Formula:
    [ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) ] This score will be high for a term that appears often in a particular document, but not in many documents in the corpus, implying the term is important for that specific document.

Why is it useful?

  • Discrimination between words: Common words (like “and”, “the”, “is”) appear in many documents, so their IDF score would be close to zero, making their overall TF-IDF score low. This helps in down-weighting these frequent, less informative terms.
  • Highlighting unique terms: If a term appears often in a particular document but not in many other documents, it’ll have a high TF-IDF score, which is helpful for tasks like document summarization and keyword extraction.

Application:

TF-IDF is widely used in various applications:

  • Information Retrieval: For example, in search engines to rank documents based on a query.
  • Text Mining: For extracting keywords or significant terms from a document.
  • Document Clustering and Classification: To represent documents as vectors in a high-dimensional space for machine learning tasks.