TF-IDF stands for Term Frequency-Inverse Document Frequency. It’s a numerical statistic used to indicate the importance of a word in a document relative to a collection of documents, often called a corpus. TF-IDF is commonly used in the field of information retrieval and text mining.
Here’s a breakdown:
- Term Frequency (TF):
- This represents the frequency of a term in a particular document.
- ( \text{TF}(t, d) ) = Number of times term ( t ) appears in document ( d ) divided by the total number of terms in document ( d ).
- Formula:
[ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} ]
- Inverse Document Frequency (IDF):
- This reflects the importance of the term in the entire corpus.
- The idea is that terms that appear in many documents are less informative than those that appear in a smaller number of documents.
- Formula:
[ \text{IDF}(t, D) = \log \frac{\text{Total number of documents in corpus } D}{\text{Number of documents containing term } t} ]
- TF-IDF Score:
- It’s the product of TF and IDF.
- Formula:
[ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) ] This score will be high for a term that appears often in a particular document, but not in many documents in the corpus, implying the term is important for that specific document.
Why is it useful?
- Discrimination between words: Common words (like “and”, “the”, “is”) appear in many documents, so their IDF score would be close to zero, making their overall TF-IDF score low. This helps in down-weighting these frequent, less informative terms.
- Highlighting unique terms: If a term appears often in a particular document but not in many other documents, it’ll have a high TF-IDF score, which is helpful for tasks like document summarization and keyword extraction.
Application:
TF-IDF is widely used in various applications:
- Information Retrieval: For example, in search engines to rank documents based on a query.
- Text Mining: For extracting keywords or significant terms from a document.
- Document Clustering and Classification: To represent documents as vectors in a high-dimensional space for machine learning tasks.