# What is TF-IDF

TF-IDF stands for Term Frequency-Inverse Document Frequency. It’s a numerical statistic used to indicate the importance of a word in a document relative to a collection of documents, often called a corpus. TF-IDF is commonly used in the field of information retrieval and text mining.

Here’s a breakdown:

1. Term Frequency (TF):
• This represents the frequency of a term in a particular document.
• ( \text{TF}(t, d) ) = Number of times term ( t ) appears in document ( d ) divided by the total number of terms in document ( d ).
• Formula:
[ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} ]
1. Inverse Document Frequency (IDF):
• This reflects the importance of the term in the entire corpus.
• The idea is that terms that appear in many documents are less informative than those that appear in a smaller number of documents.
• Formula:
[ \text{IDF}(t, D) = \log \frac{\text{Total number of documents in corpus } D}{\text{Number of documents containing term } t} ]
1. TF-IDF Score:
• It’s the product of TF and IDF.
• Formula:
[ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) ] This score will be high for a term that appears often in a particular document, but not in many documents in the corpus, implying the term is important for that specific document.

### Why is it useful?

• Discrimination between words: Common words (like “and”, “the”, “is”) appear in many documents, so their IDF score would be close to zero, making their overall TF-IDF score low. This helps in down-weighting these frequent, less informative terms.
• Highlighting unique terms: If a term appears often in a particular document but not in many other documents, it’ll have a high TF-IDF score, which is helpful for tasks like document summarization and keyword extraction.

### Application:

TF-IDF is widely used in various applications:

• Information Retrieval: For example, in search engines to rank documents based on a query.
• Text Mining: For extracting keywords or significant terms from a document.
• Document Clustering and Classification: To represent documents as vectors in a high-dimensional space for machine learning tasks.