Sometimes, we want to determine how similar two data entries are compared to each other
To do this, we need to have some way of measuring similarity between those two entries
Therefore, we use dissimilarity or distance measures
Simple matching coefficients (SMC) and jaccard coefficients are calculated on binary data values
Euclidean distances and cosine distances are calculated on continuous data values
Simple Matching Coefficient
The simple matching coefficient (or SMC) is a measure of dissimilarity between binary variables
The SMC calculates the percentage of mutual presences and mutual absences between the variable values
The simple matching coefficient is defined as the following:
SMC=M00+M01+M10+M11M00+M11
Where M00 is the number of matching False values between two data entries
Where M11 is the number of matching True values between two data entries
Where M10 is the number of differing values where the first entry's value is True and the second entry's value is False
Where M01 is the number of differing values where the first entry's value is False and the second entry's value is True
Jaccard Coefficient
The Jaccard similarity coefficient is a measure of dissimilarity between binary variables
The Jaccard coefficient calculates the percentage of mutual presences between the variable values
The Jaccard coefficient can be thought of as a generalized case of the SMC
The Jaccard coefficient is defined as the following:
Jaccard=M01+M10+M11M11
Where M11 is the number of matching True values between two data entries
Where M10 is the number of differing values where the first entry's value is True and the second entry's value is False
Where M01 is the number of differing values where the first entry's value is False and the second entry's value is True
Dice Coefficient
The Dice coefficient is a measure of dissimilarity between nominal variables
The Dice coefficient calculates the percentage of mutual presences between the variable values (with mutual absenses)
The Dice coefficient is used in NLP, specifically when using a bag-of-words approach
The Dice coefficient includes the Jaccard coefficient
Therefore, the Dice coefficient is very similar to the Jaccard coefficient, but doesn't account for true negatives
The Dice coefficient is defined as the following:
Dice=1+J2J
Where J is the Jaccard coefficient
Use-Cases for SMC and Jaccard Coefficients
The SMC coefficient is used for symmetrical binary variables, meaning the two classes associated with the binary variables are thought to have equal importance
On the other hand, the Jaccard coefficient is typically used for assymetrical binary variables, meaning the two classes associated with the binary variables are thought to have different importances
However, the Jaccard coefficient can also be used for symmetrical binary variables
It's important to note that we can transform multi-class categorical variables into many separate binary variables so that we can calculate SMC or Jaccard coefficients on the data entry
Since the SMC is used for symmetrical binary variables, the SMC is typically used to detect cheating on two different exams
Since the Jaccard coefficient is used for asymmetrical binary variables, the Jaccard coefficient is typically used to detect fraud between two different documents
Therefore, the Jaccard coefficient is typically used when we have many False values (or TN) that we don't want to account for in our similarity calculation
Example of the SMC
id
Sex
25+
Cough
Fever
Chills
Headache
Sore Throat
01
Female
Yes
Yes
No
Yes
No
Yes
02
Male
Yes
Yes
No
Yes
Yes
No
For this example, we will exclude the id variable from our SMC calculation
For this example, we will exclude the id variable from our Jaccard coefficient calculation
The jaccard coefficient is calculated as the following:
Jaccard=M01+M10+M11M11=1+2+33=63=0.5
Where M01=1 (from Headache)
Where M10=2 (from Sex and Sore Throat)
Where M11=3 (from 25+, Cough, and Chills)
Example of the Dice Coefficient
id
Sex
25+
Cough
Fever
Chills
Headache
Sore Throat
01
Female
Yes
Yes
No
Yes
No
Yes
02
Male
Yes
Yes
No
Yes
Yes
No
For this example, we will exclude the id variable from our Dice coefficient calculation
The dice coefficient is calculated as the following:
Dice=2M11+M01+M102M11=(2×3)+1+22×3=96=0.66
Where M01=1 (from Headache)
Where M10=2 (from Sex and Sore Throat)
Where M11=3 (from 25+, Cough, and Chills)
Euclidean Distance
The Euclidean distance is a measure of dissimilarity between continuous variables
The Euclidean distance calculates the ordinary straight-line distance between two points in Euclidean space
The Euclidean distance can also be referred to as the Pythagorean distance
A generalized term for the Euclidean norm is the L2 norm
The Euclidean distance is defined as the following:
EUC=i=1∑n(xi−yi)2
Great-Circle Distance
The great-circle distance (or orthodromic distance) is a measure of dissimilarity between continuous variables
The great-circle formula calculates the shortest distance between two points on the surface of a sphere
The Earth is nearly spherical, so the great-circle distance provides an accurate distance between two points on the surface of the Earth within about 0.5%
The great-circle distance is defined as the following:
GC=rδσ
Where r is the radius of the sphere
Where δσ=arccos(sin(ω1)sin(ω2)+cos(ω1)cos(ω2)cos(λ2−λ1))
Where λi is the latitude of point i
Where ωi is the longitude of point i
Haversine Distance
The haversine distance is a measure of dissimilarity between continuous variables
The haversine formula calculates the shortest distance between two points on the surface of the Earth
In other words, the haversine formula calculates the great-circle distance between two points with an adjustment to provide a birds-eye-view distance between two points
Obviously, the haversine distance ignores any hills and assumes smooths land
The haversine distance is defined as the following:
HS=rd
Where r is the radius of the Earth
Where d=2rarcsin(sin2(2λ2−λ1)+cos(λ1)cos(λ2)sin2(2ω2−ω1))
Where λi is the latitude (in radians) of point i
Where ωi is the longitude (in radians) of point i
Manhattan Distance
The Manhattan distance is a measure of dissimilarity between continuous variables
The Manhattan distance calculates the distance between two points measured along axes at right angles
A generalized term for the Euclidean norm is the L1 norm
If the data is high dimensional, the Manhattan distance is usually preferred over the Euclidean distance
The manhattan distance is defined as the following:
MH=i=1∑n∣xi−yi∣
Cosine Distance
The cosine distance is a measure of dissimilarity between binary, nominal, or continuous variables
Specifically, the cosine distance is a measure of dissimilarity between two (and only two) vectors
The cosine formula calculates the cosine of the angle between the two vectors
Therefore, the cosine similarity is a measurement of orientation and not magnitude
Note that even if we had a vector pointing far from another vector, they still could have a small angle
Also, note that the cosine similarity is essentially the same as the euclidean distance on normalized data
The cosine distance is defined as the following:
θ=arccos(∥a∥∥b∥∑i=1naibi)
Where a represents vector a
Where b represents vector b
Where ∥a∥=∑a2
Where ∥b∥=∑b2
Hellinger Distance
The hellinger distance is used to quantify the similarity between two probability distributions
The hellinger distance is a probabilistic analog of the euclidean distance
Given two probability distributions P and Q, the hellinger distance is defined as:
h(P,Q)=21∥P−Q∥2
The hellinger distance is useful when quantifying the difference between two probability distributions
For example, let's say we're estimating a distribution for users and a distribution for non-users of a service
If the hellinger distance is small between those groups for some features, then those features are not statistically useful for segmentation
The wasserstein metric is becoming a more preferred metric for measuring the similarity between two probability distributions
Use-Cases for Euclidean Distance, Manhattan Distance, and Cosine Distance
The euclidean disance is represented as a distance in the physical world, which is a natural notion of distance
The euclidean distance is frequently used for finding the nearest hospital for emergency helicopter flights
The euclidean distance is also used in natural language processing applications
Specifically, the euclidean distance is calculated with a bag-of-words representation, while normalizing the word count vectors by the euclidean length of words in each document
The manhattan distance is typically preferred to the euclidean distance for the case of high dimensional data, since it can provide similar distances to the euclidean distance
The manhattan formula is frequently used for measuring the distances in chess, compressed sensingm and frequency distributions
The cosine similarity is represented as an angle between two vectors
The cosine similarity is typically used in natural language processing applications
Specifically, the cosine similarity is used to measure how similar documents are to each other (irrespective of their size)
Example of Euclidean Distance
document
car
bike
tire
she
sand
bench
doctor
01
5
0
3
1
0
2
0
02
3
0
5
0
1
6
0
For this example, we will exclude the document variable from our euclidean calculations
The euclidean distance is calculated as the following: