Hi Andrejs,The calculations are a bit different to what I've come across in Mining Massive
Datasets (2nd Ed. Ullman et. al., Cambridge Press) available here:http://www.mmds.org/
Their calculation of IDF is as follows:
IDFi = log2(N / ni)
where N is the number of documents and ni is the number of documents in which the word appears.
This looks different to your IDF function.
For TF, they use
TFij = fij / maxk fkj
That is:
For document j, the term frequency of the term i in j is the number of times i appears
in j divided by the maximum number of times any term appears in j. Stop words are usually
excluded when considering the maximum).
So, in your case, the
TFa1 = 2 / 2 = 1
TFb1 = 1 / 2 = 0.5TFc1 = 1/2 = 0.5TFm1 = 2/2 = 1...
IDFa = log2(3 / 2) = 0.585
So, TFa1 * IDFa = 0.585
Wikipedia mentions an adjustment to overcome biases for long documents, by calculating TFij
= 0.5 + {(0.5*fij)/maxk fkj}, but that doesn't change anything for TFa1, as the value remains
1.
In other words, my calculations don't agree with yours, and neither seem to agree with Spark
:)
Regards,Ashic.
Date: Thu, 30 Oct 2014 22:13:49 +0000
Subject: how idf is calculated
From: andrejs@sindicetech.com
To: user@spark.incubator.apache.org
Hi,I'm writing a paper and I need to calculate tfidf. Whit your help I managed to get results,
I needed, but the problem is that I need to be able to explain how each number was gotten.
So I tried to understand how idf was calculated and the numbers i get don't correspond to
those I should get .
I have 3 documents (each line a document)a a b c m me a c d e ed j k l m m c
When I calculate tf, I get this (1048576,[99,100,106,107,108,109],[1.0,1.0,1.0,1.0,1.0,2.0])(1048576,[97,98,99,109],[2.0,1.0,1.0,2.0])(1048576,[97,99,100,101],[1.0,1.0,1.0,3.0]
idf is supposedly calculated idf = log((m + 1) / (d(t) + 1))m number of documents (3 in my
case).d(t)  in how many documents is term presenta: log(4/3) =0.1249387366b: log(4/2) =0.3010299957c:
log(4/4) =0d: log(4/3) =0.1249387366e: log(4/2) =0.3010299957l: log(4/2) =0.3010299957m: log(4/3)
=0.1249387366
When I output idf vector ` idf.idf.toArray.filter(_.>(0)).distinct.foreach(println(_))
`I get :1.38629436111989060.287682072451780850.6931471805599453
I understand why there are only 3 numbers, because only 3 are unique : log(4/2), log(4/3),
log(4/4), but I don't understand how numbers in idf where calculated
Best regards,Andrejs
