参考 Ledolter (2013)

tf-idf (term frequency/inverse document frequency) score

\[\text{tf-idf} = f_{ij} \times \log(\frac{n}{d_j})\]

Consider a document containing 10,000 words wherein the word donkey appears 300 times. Following the earlier definition, the term frequency (tf) for donkey is (300/10, 000) = 0.03. Now, assume we have 1,000 documents and donkey appears in 10 of these. Then, its inverse document frequency (idf) is calculated as log (1000/10) = 2. The tf-idf score is the product of these quantities: 0.03 × 2 = 0.06

假设一个文档有1万个字,其中含有donkey这个词有300词, 那么

\[\text{tf} = 300/10000 = 0.03\] 假设现在有1000个文档,其中有10个文档含有 doukey 这个文字,那么

\[\text{idf} = \log(1000/10) = 2\]

因此得分为

\[\text{tf-idf} = 0.03 \times 2 = 0.06\]

Several other preprocessing steps can be used to get a meaningful list of words and their counts (frequencies). Words can be single words, or bigrams of words. Bigrams are groups of two adjacent words, and such bigrams are commonly used as the basis for the statistical analysis of text. Bigrams can be extended to trigrams (three adjacent words) and, more general, n-grams, which are sequences of n adjacent words.

同时我们发现这样的计算也有问题,就是不考虑词组,因此出现了 n-grams 是专门加入了词组考量的。

Ledolter, Johannes. 2013. Data Mining and Business Analytics with R. 1st ed. Wiley.