public class TfIdf extends Object
TF(document, word)
is term frequency: the number of
occurrences of a given word in a given document. TF
is expected
to correlate with the relevance of the word to the document.
DF(word)
be the document frequency of a word: the
number of documents a given word occurs in.
IDF(word)
is the inverse document frequency of a word:
log(D/DF)
where D
is the overall number of documents.
IDF is expected to correlate with the salience of the word: a high value
means it's highly specific to the documents it occurs in. For example,
words like "in" and "the" have an IDF of zero because they occur
everywhere.
TF-IDF(document, word)
is the product of TF * IDF
for a
given word in a given document.
When you enter a search phrase, the program first crosses out the stopwords, then looks up each remaining search term in the inverted index, resulting in a set of documents for each search term. It takes an intersection of all these sets, which gives us only the documents that contain all the search terms. For each combination of document and search term there will be an associated TF-IDF score. It sums up these scores per document to retrieve the total score of each document. Finally, it sorts the list of documents by score (descending) and presents them to the user as the search result.
Constructor and Description |
---|
TfIdf() |
public static void main(String[] args)
Copyright © 2020 Hazelcast, Inc.. All rights reserved.