Jump to content

Talk:Jaccard index

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Why does cosine similarity redirect here? It needs a separate page and the redirect should be removed.

Please someone remove capital sigma and find some other letter to qualify for and to replace NOT pixel, — Preceding unsigned comment added by 5.12.231.73 (talk) 03:46, 15 March 2024 (UTC)[reply]
read that "sigma with no sub" and "stand for pixels"
5.12.231.73 (talk) 03:49, 15 March 2024 (UTC)[reply]

Tanimoto coefficient

[edit]

Dear me, I've finally tracked down a paper actually by Taffee Tanimoto, from 1960, about similarity and distance....

If we assume these are the correct "Tanimoto" functions, then similarity is the same as Jaccard (but expressed over bit vectors) and distance is

which I haven't seen anywhere in the more modern literature. This is not a distance function, and this is a deliberate choice!

So it would appear that the reading of the vector function as a specialising multiset function is just wrong, but a neat idea... seems to be my invention, not Tanimoto's!

Why the hell did anyone ever express this in terms of vector arithmetic when it's a bitmap operator?????


_____ —Preceding unsigned comment added by RichardThePict (talkcontribs) 22:59, 20 May 2011 (UTC)[reply]


I am proposing to replace this entire section with the text below, if nobody objects:

___


The Tanimoto Similarity Coefficient is a generalisation of Jaccard which allows for a similarity to be calculated over multisets.

To calculate Tanimoto Similarity, the multisets must first be viewed as vectors. In some arbitrary order, each element of each set is considered as a vector dimension, and the cardinality of the element within that set is used as the magnitude of that dimension in its representative vector. For example, the multisets {apple, apple, pear} and {banana, pear} may be viewed as the vectors (2,0,1) and (0,1,1) respectively if the elements are considered in alphabetical order.

Cosine Similarity already gives a similarity coefficient over vectors, bounded in [0,1] when all dimensions are positive or zero. However, the Cosine Similarity of the simple sets {apple, pear} and {banana, pear} yields one half, whereas the Jaccard Coefficient of these sets is one third.

The purpose of the Tanimoto Coefficient is thus dual: to give a similarity coefficient over multisets, but also to specialise to the same value as Jaccard Coefficent when only simple sets are considered. The equation identified by Tanimoto (also, anecdotally, by Jaccard 50 years earlier) is:

Notice that the product and magnitude operators are vector, not set, operators. When two sets (not multisets) are viewed as vectors, the dot product of those vectors is equal to the cardinality of their intersection, and the square of each vector magnitude is the cardinality of the set itself. Thus, the formula immediately specialises to the Jaccard Coefficent for simple sets. Tanimoto is bounded within [0,1] if all vector dimensions are positive.

Tanimoto Distance (1 - T(A,B)) is often quoted as being a proper distance metric. This is not true, but is true if all values of each dimension are constant or zero. Tanimoto is therefore a proper distance metric for use over weighted sets modelled as vectors, but not over multisets in general.

Tanimoto is extensively used in chemistry, to give a similarity metric for molecules. Most sources use the term inaccurately, as a synonym for Jaccard.

RichardThePict (talk) 14:22, 9 May 2011 (UTC)[reply]


Other sources gives the Tanimoto coefficient as

So without the squares. This also makes more sense when compared to the Jaccard distance.

Some other sources are:

What is the correct definition?

Twanvl (talk) 17:35, 11 November 2008 (UTC)[reply]


Another source gives this, as in practical use for molecule similarity, as

[1]

which is superficially the same but doesn't appear to use vector arithmetic, so the magnitude function is set cardinality rather than vector magnitude as defined in this article. The scalar product is of course the same as the cardinality of the intersection unless we're talking about multisets. Are we, by the way, talking about mulitsets?!?! I If not, why does the scalar product feature in the article?

Of course this could just be in error, if someone has transcribed a formula without looking at the meaning of the syntax...

Then, in

[2]

we have :

Tanimoto distance. The distance between two sets is computed in the following way:

             n1 + n2 - 2*n12
D(S1, S2) = ------------------
             n1 + n2 - n12

, where n1 and n2 are the numbers of elements in sets S1 and S2, respectively, and n12 is the number that is in both sets.


Tanimoto apparently defined something in 1947 in an IBM Technical Journal which is hard or impossible to access. However others have related it to a digression by Jaccard and published by him in or around 1903, and it has been known as the Jaccard-Tanimoto Index, I believe. Sorry, I don't have attribution for any of this info, but would love to come to a conclusion about how this function actually is defined! All the citations/references I've found in the research literature seem to just copy each other, sometimes adding mistakes....

130.159.185.103 (talk) 11:50, 6 May 2011 (UTC) RC[reply]

It's intereseting that there is so much rubbish talked about this coefficient.... hopefully this is not because there's a bad Wikipedia entry on the subject!

The point of Tanimoto seems that it is a generalisation of Jaccard, ie it's a vector metric (that can therefore be used for multiset similarity with a simple isomorphism) which specialises to Jaccard when the multiset cardinalities don't exceed 1. This section really needs to be completely rewritten to reflect this - I might do this in a while if nobody proposes anything else.

Also, the bit about the [-1,1] range is garbage; those values are all positive (I believe) and so the function range is [0,1].

RichardThePict (talk) 13:08, 9 May 2011 (UTC)[reply]

References

Divide by zero for empty sets

[edit]

What happens when both sets are empty sets? The metric gives a divide-by-zero. Should the metric only work for non-empty sets or is there a specific value (0.0 or 1.0) that should be returned in this case? pgr94 (talk) 07:27, 22 December 2010 (UTC)[reply]

Everywhere I look, Jaquard is for non-empty sets. Unfortunately, I don't have a dead tree reference that mentions it. Charles Merriam (talk) 17:54, 4 July 2013 (UTC)[reply]

Mountford's index of similarity

[edit]

"See also" points to "Mountford's index of similarity" ... this is a dead link — Preceding unsigned comment added by 67.81.4.49 (talk) 11:24, 15 October 2011 (UTC)[reply]

Jaccard Similarity Calculation -- mistake?

[edit]

In the section "Similarity of asymmetric binary attributes", the article states:

The Jaccard similarity coefficient, J, is given as

This would give a result of 0 where A and B both have a value of 0 (and are, therefore, similar). Shouldn't the correct equation be the following?:

— Preceding unsigned comment added by AmanAhuja (talkcontribs) 02:53, 15 March 2012 (UTC)[reply]

Update: never mind, I was conflating the Jaccardian with the Hamming Distance. AmanAhuja (talk) 03:23, 15 March 2012 (UTC)[reply]

Tversky coefficient

[edit]

I think the whole equivalence classification needs even more cleanup. The entry covering Tversky_index speaks of it as a generalization of Tanimoto and Dice. The comment above speaks of Tanimoto being a generalization of Jaccard, the same author speaks in Talk:Tversky_index that this actually is NOT the case. I must say this is quite confusing. LinguistManiac (talk) 06:56, 22 May 2012 (UTC)[reply]

Clarification

[edit]

A diagram would help introduce this concept to beginners, and perhaps a worked example in the context of e.g. NLP eval Leondz (talk) 11:37, 18 March 2014 (UTC)[reply]

Set of sets

[edit]

The comment being removed in the first section claims that the measure is a metric on sets of sets and gives two citations. Neither citation mentions sets of sets, or formulates the metric in a way that would be applicable to a set of sets. — Preceding unsigned comment added by Amoss (talkcontribs) 10:33, 1 October 2014 (UTC)[reply]

Redundant comparison on two pages

[edit]

Jaccard index and Simple_matching_coefficient page both have comparison of each other in their page. It's probably better to make one link to the other.

--Qria (talk) 09:44, 13 November 2016 (UTC)[reply]

Similarity of asymmetric binary attributes -- mistake?

[edit]

Under 'Similarity of asymmetric binary attributes' it says:

M11 + M01 + M10 + M00 = 2*n (n is number of binary features of A and B)
But mustn't it be: M11 + M01 + M10 + M00 = n ?

Difference with the Simple matching coefficient (SMC)

[edit]

Article states:

"Using the SMC would then induce a bias by systematically considering, as more similar, two customers with large identical baskets compared to two customers with identical but smaller baskets, thus making the Jaccard index a better measure of similarity in that context."

However, customers with identical baskets will have a similarity of 1 regardless of the size of their basket for both Jaccard and SMC measures. 89.101.126.62 (talk) 14:14, 18 February 2017 (UTC)[reply]

I absolutely agree, thx for bringing it up! This section had a few mistakes actually. I rewrote the paragraph, hopefully that should fix this :)7804j (talk) 20:40, 4 March 2017 (UTC)[reply]

Ruzicka similarity

[edit]

I tried to track down details on "also known then as Ruzicka similarity" which is currently noted with a "citation needed". The comment was added in 2018 with the message "added synonyms of the generalized case".

Google Scholar finds fewer then 200 matches to "Ruzicka similarity", nearly all within the last 10 years, and most without any citation. I cannot help but suspect they looked at this page, saw the name, and decided to use it.

There are a few papers which cite: S.-H. Cha. Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4):300–307, 2007.

This in turn cites: Morisita M. Measuring of interspecific association and similarity between communities. Mem. Fac. Sci. Kyushu Univ. Ser. E (Biol.) 3:65-80, 1959. The DeepL translation of the Japanese text shows a more complex expression, and makes no mention of Ruzicka.

Instead, the proper citation seems to be: M. Ruzicka, Anwendung mathematisch-statisticher Methoden in der Geobotanik (synthetische Bearbeitung von Aufnahmen) Biologia, Bratislava, 13 (1958), pp. 647-661.

I cannot find a copy of this paper, but I did find two papers which describe the method, plus a third using the same sources as the first:

  1. Balconi et al, Cognitive distance in research collaborations(2013) uses 100 x ∑ min(xi1, xi2) / ∑max(xi1, xi2) as "Ružička's index of similarity" citing Pielou "The Interpretation of Ecological Data" (1984)
  2. St. Clair et al. Rapid stabilization of fire-disturbed sites using a soil crust slurry: inoculation studies(1984) which used "the similarity index first proposed by Ruzicka (1958), i.e.∑minT / ∑maxC x 100.
  3. Sellman et al. Using fish to sample diatom composition in streams: are intestinal floras representative of natural substrates (2001) which says "Samples were clustered using Ruzicka's Similarity Index (Ruzicka 1958) and the unweighted group average clustering algorithm (Pielou 1984)".

This makes me think the Ruzicka Similarity Index is the generalized Jaccard Index x 100.

In any case, I can't make sense of the text "also known then." When is "then"? Known by whom? And who really was the first to generalize Jaccard Index? And why refer to it as "Ruzicka similarity" when we refer to this as the "Jaccard Index" even though the page says Gilbert developed it in 1884? 85.226.230.95 (talk) 22:21, 10 July 2025 (UTC)[reply]

I have found an earlier publication of an alternate way to describe the same result as the min/max formulation of Ruzicka similarity.
Henry Allan Gleason, "Some Applications of the Quadrat Method", Bulletin of the Torrey Botanical Club, Vol. 47, No. 1 (Jan., 1920), pp. 21-33, starting on page 31, says: "Jaccard's method fails to take account of the much greater importance of some abundant species, and the resulting error of computation may be obviated, in part at least, by weighting each species with its frequency index."
He proposed a modified community coefficient computed by dividing the sum of the numbers "In first area only", "Common to both areas", and "In second area only" into "Common to both areas", with the example in Table II dividing 13.5 + 433.5 + 8.5 = 455.5 into 433.5 to get a community coefficient of 95 (since 433.5/455.5 = 0.9517..).
Gleason's work is mentioned elsewhere, like "Aims and methods of vegetation ecology", Mueller-Dombois and Ellenberg, p216 saying "GLEASON (1920) applied quantitative values directly to JACCARD’s formula without modification." to give Mc/(Ma+Mb+Mc) x 100, where "Mc is the sum of the percent biomass values of the species common to both stands, Ma is the sum of the biomass values of the species restricted to the first stand, and Mb is the corresponding sum for the species restricted to the second stand." (p215). 85.226.230.95 (talk) 18:45, 31 August 2025 (UTC)[reply]