Similarity Measures

Motivating Similarity Measures

  • Sometimes, we want to determine how similar two data entries are compared to each other
  • To do this, we need to have some way of measuring similarity between those two entries
  • Therefore, we use dissimilarity or distance measures
  • Simple matching coefficients (SMC) and jaccard coefficients are calculated on binary data values
  • Euclidean distances and cosine distances are calculated on continuous data values

Simple Matching Coefficient

  • The simple matching coefficient (or SMC) is a measure of dissimilarity between binary variables
  • The SMC calculates the percentage of mutual presences and mutual absences between the variable values
  • The simple matching coefficient is defined as the following:
SMC=M00+M11M00+M01+M10+M11SMC = \frac{M_{00}+M_{11}}{M_{00}+M_{01}+M_{10}+M_{11}}
  • Where M00M_{00} is the number of matching False values between two data entries
  • Where M11M_{11} is the number of matching True values between two data entries
  • Where M10M_{10} is the number of differing values where the first entry's value is True and the second entry's value is False
  • Where M01M_{01} is the number of differing values where the first entry's value is False and the second entry's value is True

Jaccard Coefficient

  • The Jaccard similarity coefficient is a measure of dissimilarity between binary variables
  • The Jaccard coefficient calculates the percentage of mutual presences between the variable values
  • The Jaccard coefficient can be thought of as a generalized case of the SMC
  • The Jaccard coefficient is defined as the following:
Jaccard=M11M01+M10+M11Jaccard = \frac{M_{11}}{M_{01}+M_{10}+M_{11}}
  • Where M11M_{11} is the number of matching True values between two data entries
  • Where M10M_{10} is the number of differing values where the first entry's value is True and the second entry's value is False
  • Where M01M_{01} is the number of differing values where the first entry's value is False and the second entry's value is True

Dice Coefficient

  • The Dice coefficient is a measure of dissimilarity between nominal variables
  • The Dice coefficient calculates the percentage of mutual presences between the variable values (with mutual absenses)
  • The Dice coefficient is used in NLP, specifically when using a bag-of-words approach
  • The Dice coefficient includes the Jaccard coefficient
  • Therefore, the Dice coefficient is very similar to the Jaccard coefficient, but doesn't account for true negatives
  • The Dice coefficient is defined as the following:
Dice=2J1+JDice = \frac{2J}{1+J}
  • Where JJ is the Jaccard coefficient

Use-Cases for SMC and Jaccard Coefficients

  • The SMC coefficient is used for symmetrical binary variables, meaning the two classes associated with the binary variables are thought to have equal importance
  • On the other hand, the Jaccard coefficient is typically used for assymetrical binary variables, meaning the two classes associated with the binary variables are thought to have different importances
  • However, the Jaccard coefficient can also be used for symmetrical binary variables
  • It's important to note that we can transform multi-class categorical variables into many separate binary variables so that we can calculate SMC or Jaccard coefficients on the data entry
  • Since the SMC is used for symmetrical binary variables, the SMC is typically used to detect cheating on two different exams
  • Since the Jaccard coefficient is used for asymmetrical binary variables, the Jaccard coefficient is typically used to detect fraud between two different documents
  • Therefore, the Jaccard coefficient is typically used when we have many False values (or TNTN) that we don't want to account for in our similarity calculation

Example of the SMC

id Sex 25+ Cough Fever Chills Headache Sore Throat
01 Female Yes Yes No Yes No Yes
02 Male Yes Yes No Yes Yes No
  • For this example, we will exclude the id variable from our SMC calculation
  • The SMC is calculated as the following:
SMC=M00+M11M00+M01+M10+M11=1+21+1+2+3=37=0.43SMC = \frac{M_{00}+M_{11}}{M_{00}+M_{01}+M_{10}+M_{11}} = \frac{1+2}{1+1+2+3} = \frac{3}{7} = 0.43
  • Where M00=1M_{00} = 1 (from Fever)
  • Where M01=1M_{01} = 1 (from Headache)
  • Where M10=2M_{10} = 2 (from Sex and Sore Throat)
  • Where M11=3M_{11} = 3 (from 25+, Cough, and Chills)

Example of the Jaccard Coefficient

id Sex 25+ Cough Fever Chills Headache Sore Throat
01 Female Yes Yes No Yes No Yes
02 Male Yes Yes No Yes Yes No
  • For this example, we will exclude the id variable from our Jaccard coefficient calculation
  • The jaccard coefficient is calculated as the following:
Jaccard=M11M01+M10+M11=31+2+3=36=0.5Jaccard = \frac{M_{11}}{M_{01}+M_{10}+M_{11}} = \frac{3}{1+2+3} = \frac{3}{6} = 0.5
  • Where M01=1M_{01} = 1 (from Headache)
  • Where M10=2M_{10} = 2 (from Sex and Sore Throat)
  • Where M11=3M_{11} = 3 (from 25+, Cough, and Chills)

Example of the Dice Coefficient

id Sex 25+ Cough Fever Chills Headache Sore Throat
01 Female Yes Yes No Yes No Yes
02 Male Yes Yes No Yes Yes No
  • For this example, we will exclude the id variable from our Dice coefficient calculation
  • The dice coefficient is calculated as the following:
Dice=2M112M11+M01+M10=2×3(2×3)+1+2=69=0.66Dice = \frac{2M_{11}}{2M_{11}+M_{01}+M_{10}} = \frac{2 \times 3}{(2 \times 3) + 1 + 2} = \frac{6}{9} = 0.66
  • Where M01=1M_{01} = 1 (from Headache)
  • Where M10=2M_{10} = 2 (from Sex and Sore Throat)
  • Where M11=3M_{11} = 3 (from 25+, Cough, and Chills)

Euclidean Distance

  • The Euclidean distance is a measure of dissimilarity between continuous variables
  • The Euclidean distance calculates the ordinary straight-line distance between two points in Euclidean space
  • The Euclidean distance can also be referred to as the Pythagorean distance
  • A generalized term for the Euclidean norm is the L2 norm
  • The Euclidean distance is defined as the following:
EUC=i=1n(xiyi)2EUC = \sqrt{\sum_{i=1}^{n}(x_{i}-y_{i})^{2}}

Great-Circle Distance

  • The great-circle distance (or orthodromic distance) is a measure of dissimilarity between continuous variables
  • The great-circle formula calculates the shortest distance between two points on the surface of a sphere
  • The Earth is nearly spherical, so the great-circle distance provides an accurate distance between two points on the surface of the Earth within about 0.5%
  • The great-circle distance is defined as the following:
GC=rδσGC = r\delta\sigma
  • Where rr is the radius of the sphere
  • Where δσ=arccos(sin(ω1)sin(ω2)+cos(ω1)cos(ω2)cos(λ2λ1))\delta\sigma = \arccos(\sin(\omega_{1})\sin(\omega_{2}) + \cos(\omega_{1})\cos(\omega_{2})\cos(\lambda_{2}-\lambda_{1}))
  • Where λi\lambda_{i} is the latitude of point ii
  • Where ωi\omega_{i} is the longitude of point ii

Haversine Distance

  • The haversine distance is a measure of dissimilarity between continuous variables
  • The haversine formula calculates the shortest distance between two points on the surface of the Earth
  • In other words, the haversine formula calculates the great-circle distance between two points with an adjustment to provide a birds-eye-view distance between two points
  • Obviously, the haversine distance ignores any hills and assumes smooths land
  • The haversine distance is defined as the following:
HS=drHS = \frac{d}{r}
  • Where rr is the radius of the Earth
  • Where d=2rarcsin(sin2(λ2λ12)+cos(λ1)cos(λ2)sin2(ω2ω12))d = 2r \arcsin(\sqrt{\sin^{2}(\frac{\lambda_{2}-\lambda_{1}}{2}) + \cos(\lambda_{1})\cos(\lambda_{2})\sin^{2}(\frac{\omega_{2}-\omega_{1}}{2})})
  • Where λi\lambda_{i} is the latitude (in radians) of point ii
  • Where ωi\omega_{i} is the longitude (in radians) of point ii

Manhattan Distance

  • The Manhattan distance is a measure of dissimilarity between continuous variables
  • The Manhattan distance calculates the distance between two points measured along axes at right angles
  • A generalized term for the Euclidean norm is the L1 norm
  • If the data is high dimensional, the Manhattan distance is usually preferred over the Euclidean distance
  • The manhattan distance is defined as the following:
MH=i=1nxiyiMH = \sum_{i=1}^{n}|x_{i}-y_{i}|

Cosine Distance

  • The cosine distance is a measure of dissimilarity between binary, nominal, or continuous variables
  • Specifically, the cosine distance is a measure of dissimilarity between two (and only two) vectors
  • The cosine formula calculates the cosine of the angle between the two vectors
  • Therefore, the cosine similarity is a measurement of orientation and not magnitude
  • Note that even if we had a vector pointing far from another vector, they still could have a small angle
  • Also, note that the cosine similarity is essentially the same as the euclidean distance on normalized data
  • The cosine distance is defined as the following:
θ=arccos(i=1naibiab)\theta = \arccos(\frac{\sum_{i=1}^{n}a_{i}b_{i}}{\Vert a \Vert \Vert b \Vert})
  • Where aa represents vector aa
  • Where bb represents vector bb
  • Where a=a2\Vert a \Vert = \sqrt{\sum a^{2}}
  • Where b=b2\Vert b \Vert = \sqrt{\sum b^{2}}

Hellinger Distance

  • The hellinger distance is used to quantify the similarity between two probability distributions
  • The hellinger distance is a probabilistic analog of the euclidean distance
  • Given two probability distributions PP and QQ, the hellinger distance is defined as:
h(P,Q)=12PQ2h(P,Q) = \frac{1}{\sqrt{2}} \Vert \sqrt{P} - \sqrt{Q} \Vert_{2}
  • The hellinger distance is useful when quantifying the difference between two probability distributions
  • For example, let's say we're estimating a distribution for users and a distribution for non-users of a service
  • If the hellinger distance is small between those groups for some features, then those features are not statistically useful for segmentation
  • The wasserstein metric is becoming a more preferred metric for measuring the similarity between two probability distributions

Use-Cases for Euclidean Distance, Manhattan Distance, and Cosine Distance

  • The euclidean disance is represented as a distance in the physical world, which is a natural notion of distance
  • The euclidean distance is frequently used for finding the nearest hospital for emergency helicopter flights
  • The euclidean distance is also used in natural language processing applications
  • Specifically, the euclidean distance is calculated with a bag-of-words representation, while normalizing the word count vectors by the euclidean length of words in each document
  • The manhattan distance is typically preferred to the euclidean distance for the case of high dimensional data, since it can provide similar distances to the euclidean distance
  • The manhattan formula is frequently used for measuring the distances in chess, compressed sensingm and frequency distributions
  • The cosine similarity is represented as an angle between two vectors
  • The cosine similarity is typically used in natural language processing applications
  • Specifically, the cosine similarity is used to measure how similar documents are to each other (irrespective of their size)

Example of Euclidean Distance

document car bike tire she sand bench doctor
01 5 0 3 1 0 2 0
02 3 0 5 0 1 6 0
  • For this example, we will exclude the document variable from our euclidean calculations
  • The euclidean distance is calculated as the following:
EUC=i=1n(xiyi)2EUC = \sqrt{\sum_{i=1}^{n}(x_{i}-y_{i})^{2}} =(53)2+(00)2+(35)2+(10)2+(01)2+(26)2+(00)2= (5-3)^{2} + (0-0)^{2} + (3-5)^{2} + (1-0)^{2} + (0-1)^{2} + (2-6)^{2} + (0-0)^{2} =4+0+4+1+1+16+0=26= 4 + 0 + 4 + 1 + 1 + 16 + 0 = 26

Example of Cosine Similarity

document car bike tire she sand bench doctor
01 5 0 3 1 0 2 0
02 3 0 5 0 1 6 0
  • For this example, we will exclude the document variable from our cosine calculations
  • The cosine similarity is calculated as the following:
θ=arccos(i=1naibiab)\theta = \arccos(\frac{\sum_{i=1}^{n}a_{i}b_{i}}{\Vert a \Vert \Vert b \Vert}) =arccos(15+15+1+1+1225+9+1+49+25+1+36)= \arccos(\frac{15+15+1+1+12}{\sqrt{25+9+1+4}\sqrt{9+25+1+36}}) =arccos(446.2×8.4)=arccos(0.84)=0.57= \arccos(\frac{44}{6.2 \times 8.4}) = \arccos(0.84) = 0.57

References

Previous
Next

Basics of Clustering

K-Means