Skip to main content

This is a package for computing distances among observations of statistical variables, such as: Euclidean, Minkowski, Canberra, Pearson, Mahalanobis, Robust Mahalanobis, Gower, Generalized Gower and Related Metric Scaling (RelMS). A total of 41 statistical distances can be computed.

Project description

PyDistances: A Statistical Distances Python Package

This is a package for computing distances among observations of statistical variables, such as: Euclidean, Minkowski, Canberra, Pearson, Mahalanobis, Robust Mahalanobis, Gower, Generalized Gower and Related Metric Scaling (RelMS). A total of 41 statistical distances can be computed.

Installation

pip install PyDistances

Example of use

import PyDistances
from PyDistances import Euclidean_Dist, Euclidean_Dist_Matrix, Minkowski_Dist, Minkowski_Dist_Matrix, Canberra_Dist, Canberra_Dist_Matrix, Pearson_Dist, Pearson_Dist_Matrix, Mahalanobis_Dist, Mahalanobis_Dist_Matrix, a_b_c_d_Matrix, Sokal_Similarity, Sokal_Dist, Sokal_Dist_Matrix, Jaccard_Similarity, Jaccard_Dist, Jaccard_Dist_Matrix, alpha, Matching_Similarity, Matching_Dist, Matching_Dist_Matrix, Gower_Similarity_Matrix, Gower_Dist_Matrix, Robust_Mahalanobis_Dist, Robust_Mahalanobis_Dist_Matrix, GeneralizedGowerDistance

Getting data

We load the data we are going to work with throughout this tutorial. This data-set is available in the following link: https://github.com/FabioScielzoOrtiz/Distances_Package/blob/master/Tests/House_Price.csv

Data = pd.read_csv('House_Price.csv')
Data = Data.loc[0:150, ['latitude', 'longitude', 'price', 'size_in_m_2', 'balcony_recode', 'private_garden_recode', 'private_gym_recode', 'quality_recode', 'no_of_bathrooms', 'no_of_bedrooms']]
Data_quant = Data.loc[:,['latitude', 'longitude', 'price', 'size_in_m_2']]
Data_binary = Data.loc[:,['balcony_recode', 'private_garden_recode', 'private_gym_recode']]
Data_multiclass = Data.loc[:,['quality_recode', 'no_of_bathrooms', 'no_of_bedrooms']]
Data.head() # p1=4, p2=3, p3=3
latitude longitude price size_in_m_2 balcony private_garden private_gym quality no_of_bathrooms no_of_bedrooms
25.1132 55.1389 2.7e+06 100.242 1 0 0 2 2 1
25.1068 55.1512 2.85e+06 146.973 1 0 0 2 2 2
25.0633 55.1377 1.15e+06 181.254 1 0 0 2 5 3
25.2273 55.3418 2.85e+06 187.664 1 0 0 1 3 2
25.1143 55.1398 1.7292e+06 47.1018 0 0 0 2 1 0

Computing Euclidean distance

We compute the Euclidean distance between observation of index 0 and itself.

Euclidean_Dist(Data_quant.iloc[0,:], Data_quant.iloc[0,:])
 0.0

We compute the Euclidean distance between observation of index 0 and the one of index 2.

Euclidean_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:])
 1550000.002117049

We compute the Euclidean distances matrix for the data-set Data_quant.

Euclidean_Dist_Matrix(Data_quant)
array([[       0.        ,   150000.00727904,  1550000.00211705, ...,
         1500000.00009635,  2700000.01899102, 12100000.00553371],
       [  150000.00727904,        0.        ,  1700000.00034565, ...,
         1650000.00026782,  2550000.0146678 , 11950000.00426352],
       [ 1550000.00211705,  1700000.00034565,        0.        , ...,
           50000.040973  ,  4250000.00673279, 13650000.00297389],
       ...,
       [ 1500000.00009635,  1650000.00026782,    50000.040973  , ...,
               0.        ,  4200000.01094663, 13600000.00447653],
       [ 2700000.01899102,  2550000.0146678 ,  4250000.00673279, ...,
         4200000.01094663,        0.        ,  9400000.00011113],
       [12100000.00553371, 11950000.00426352, 13650000.00297389, ...,
        13600000.00447653,  9400000.00011113,        0.        ]])

Now, we are going to repeat the same procedure with other available distances in PyDistances.


Computing Minkowski distance

Minkowski_Dist(Data_quant.iloc[0,:], Data_quant.iloc[0,:], q=1)
 0.0
Minkowski_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:], q=1)
 1550081.062526
Minkowski_Dist_Matrix(Data_quant, q=1)
array([[       0.      ,   150046.748877,  1550081.062526, ...,
         1500017.050769,  2700320.266531, 12100365.997115],
       [  150046.748877,        0.      ,  1700034.338187, ...,
         1650029.78435 ,  2550273.554024, 11950319.272776],
       [ 1550081.062526,  1700034.338187,        0.      , ...,
           50064.027555,  4250239.302851, 13650284.955165],
       ...,
       [ 1500017.050769,  1650029.78435 ,    50064.027555, ...,
               0.      ,  4200303.29563 , 13600348.947944],
       [ 2700320.266531,  2550273.554024,  4250239.302851, ...,
         4200303.29563 ,        0.      ,  9400045.764238],
       [12100365.997115, 11950319.272776, 13650284.955165, ...,
        13600348.947944,  9400045.764238,        0.      ]])

Computing Canberra distance

Canberra_Dist(Data_quant.iloc[0,:], Data_quant.iloc[0,:])
  0.0
Canberra_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:])
 0.6913917083019879
Canberra_Dist_Matrix(Data_quant)
array([[0.        , 0.21629237, 0.69139171, ..., 0.463675  , 0.9485963 ,
        1.33838751],
       [0.21629237, 0.        , 0.53043317, ..., 0.52079671, 0.79157752,
        1.19854721],
       [0.69139171, 0.53043317, 0.        , ..., 0.23597883, 1.04765637,
        1.29619958],
       ...,
       [0.463675  , 0.52079671, 0.23597883, ..., 0.        , 1.20126891,
        1.44813664],
       [0.9485963 , 0.79157752, 1.04765637, ..., 1.20126891, 0.        ,
        0.51782969],
       [1.33838751, 1.19854721, 1.29619958, ..., 1.44813664, 0.51782969,
        0.        ]])

Computing Pearson distance

Pearson_Dist(Data_quant.iloc[0,:], Data_quant.iloc[0,:], variance=Data.var())
 0.0
Pearson_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:], variance=Data.var())
 1.5393297661160206
Pearson_Dist_Matrix(Data_quant)
array([[0.        , 0.63961801, 1.53932977, ..., 1.03084131, 4.32943281,
        7.47171915],
       [0.63961801, 0.        , 1.20505141, ..., 1.09780711, 3.76643257,
        7.04893716],
       [1.53932977, 1.20505141, 0.        , ..., 0.84617436, 3.79891055,
        7.4670243 ],
       ...,
       [1.03084131, 1.09780711, 0.84617436, ..., 0.        , 4.44143053,
        7.87905955],
       [4.32943281, 3.76643257, 3.79891055, ..., 4.44143053, 0.        ,
        4.57460318],
       [7.47171915, 7.04893716, 7.4670243 , ..., 7.87905955, 4.57460318,
        0.        ]])

Computing Mahalanobis distance

Mahalanobis_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:], S_inv=np.linalg.inv( np.cov(Data_quant , rowvar=False) ))
   0.0
Mahalanobis_Dist(Data_quant.iloc[0,:], Data_quant.iloc[2,:], S_inv=np.linalg.inv( np.cov(Data_quant , rowvar=False) ))
  2.7671855371187757
Mahalanobis_Dist_Matrix(Data_quant)
array([[0.        , 0.92801614, 2.76718554, ..., 1.52541554, 5.21105193,
        6.45997793],
       [0.92801614, 0.        , 1.96135599, ..., 0.98693199, 4.43479282,
        6.2920865 ],
       [2.76718554, 1.96135599, 0.        , ..., 1.3592188 , 3.4307313 ,
        7.27986558],
       ...,
       [1.52541554, 0.98693199, 1.3592188 , ..., 0.        , 4.41360406,
        7.01503103],
       [5.21105193, 4.43479282, 3.4307313 , ..., 4.41360406, 0.        ,
        7.4691448 ],
       [6.45997793, 6.2920865 , 7.27986558, ..., 7.01503103, 7.4691448 ,
        0.        ]])

Computing Sokal similarity

a,b,c,d,p = a_b_c_d_Matrix(Data_binary)
Sokal_Similarity(i=0, r=2, a=a, d=d, p=p)
 1.0
Sokal_Dist(i=0, r=2, a=a, d=d, p=p)
 0.0
Sokal_Dist_Matrix(Data_binary)
array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.81649658],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.81649658],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.81649658],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.81649658],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.81649658],
       [0.81649658, 0.81649658, 0.81649658, ..., 0.81649658, 0.81649658,
        0.        ]])

Computing Jaccard similarity

Jaccard_Similarity(i=0, r=2, a=a, d=d, p=p)
  1.0
Jaccard_Dist(i=0, r=2, a=a, d=d, p=p)
 0.0
Jaccard_Dist_Matrix(Data_binary)
array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [1., 1., 1., ..., 1., 1., 0.]])

Computing Matching similarity

Matching_Similarity(x_i=Data_multiclass.iloc[0,:], x_r=Data_multiclass.iloc[2,:], Data=Data_multiclass)
0.3333333333333333
Matching_Dist(x_i=Data_multiclass.iloc[0,:], x_r=Data_multiclass.iloc[2,:], Data=Data_multiclass)
   1.1547005383792517
Matching_Dist_Matrix(Data_multiclass)
array([[0.        , 0.81649658, 1.15470054, ..., 0.81649658, 1.15470054,
        1.41421356],
       [0.81649658, 0.        , 1.15470054, ..., 0.        , 1.15470054,
        1.41421356],
       [1.15470054, 1.15470054, 0.        , ..., 1.15470054, 0.81649658,
        1.15470054],
       ...,
       [0.81649658, 0.        , 1.15470054, ..., 0.        , 1.15470054,
        1.41421356],
       [1.15470054, 1.15470054, 0.81649658, ..., 1.15470054, 0.        ,
        1.15470054],
       [1.41421356, 1.41421356, 1.15470054, ..., 1.41421356, 1.15470054,
        0.        ]])

Computing Gower distance

From a theoretical perspective Gower (1971) has been followed.

Gower_Similarity_Matrix(Data, p1=4, p2=3, p3=3)
array([[1.        , 0.85175283, 0.68485131, ..., 0.83008431, 0.62482353,
        0.34709882],
       [0.85175283, 1.        , 0.69489168, ..., 0.94863663, 0.63064768,
        0.35833279],
       [0.68485131, 0.69489168, 1.        , ..., 0.72293677, 0.73120218,
        0.48172501],
       ...,
       [0.83008431, 0.94863663, 0.72293677, ..., 1.        , 0.59776459,
        0.36311382],
       [0.62482353, 0.63064768, 0.73120218, ..., 0.59776459, 1.        ,
        0.55654437],
       [0.34709882, 0.35833279, 0.48172501, ..., 0.36311382, 0.55654437,
        1.        ]])
Gower_Dist_Matrix(Data, p1=4, p2=3, p3=3)
array([[0.        , 0.38502879, 0.56138105, ..., 0.41220831, 0.61251651,
        0.808023  ],
       [0.38502879, 0.        , 0.55236611, ..., 0.22663488, 0.60774363,
        0.80104133],
       [0.56138105, 0.55236611, 0.        , ..., 0.52636796, 0.51845716,
        0.71991318],
       ...,
       [0.41220831, 0.22663488, 0.52636796, ..., 0.        , 0.63422032,
        0.79805149],
       [0.61251651, 0.60774363, 0.51845716, ..., 0.63422032, 0.        ,
        0.66592464],
       [0.808023  , 0.80104133, 0.71991318, ..., 0.79805149, 0.66592464,
        0.        ]])

Computing Robust Mahalanobis distance

From a theoretical perspective Gnanadesikan (1997) and Delvin et al. (1975) have been followed.

Robust_Mahalanobis_Dist(x_i=Data_quant.iloc[0,:], x_r=Data_quant.iloc[2,:], Data=Data_quant, Method='MAD', epsilon=0.05, n_iters=20)
 2.1448247626892223
Robust_Mahalanobis_Dist(x_i=Data_quant.iloc[0,:], x_r=Data_quant.iloc[2,:], Data=Data_quant, Method='trimmed', alpha=0.1, epsilon=0.05, n_iters=20)
 2.7434709885399884
Robust_Mahalanobis_Dist(x_i=Data_quant.iloc[0,:], x_r=Data_quant.iloc[2,:], Data=Data_quant, Method='winsorized', alpha=0.1, epsilon=0.05, n_iters=20)
 2.8446274140577943
Robust_Mahalanobis_Dist_Matrix(Data=Data_quant, Method='trimmed', alpha=0.1, epsilon=0.05, n_iters=20)
array([[ 0.        ,  0.89250845,  2.74347099, ...,  1.48503889,
         5.95276234,  8.49453068],
       [ 0.89250845,  0.        ,  1.99959936, ...,  0.96839524,
         5.33355737,  8.32070442],
       [ 2.74347099,  1.99959936,  0.        , ...,  1.36336733,
         4.12306341,  9.38094479],
       ...,
       [ 1.48503889,  0.96839524,  1.36336733, ...,  0.        ,
         5.1322854 ,  9.00337923],
       [ 5.95276234,  5.33355737,  4.12306341, ...,  5.1322854 ,
         0.        , 11.06785954],
       [ 8.49453068,  8.32070442,  9.38094479, ...,  9.00337923,
        11.06785954,  0.        ]])

Computing Generalized Gower distance and Releted Metric Scaling

To end this tutorial we are going to compute both the Gower distance matrix and the Related Metric Scaling matrix for the mixed data-set Data. And we are going to do that considering all the possible combinations of the quantitative, binary and multiclass distances. Then, we will save all the resulting matrix in a Python dictionary.

From a theoretical perspective we have followed Cuadras and Fortiana (1998), Albarrán et al. (2015) and Grané et al. (2021).

D_GG_list_maha_robust = []
D_RelMS_list_maha_robust = []
D_GG_list_not_maha_robust = []
D_RelMS_list_not_maha_robust = []

d1_list = ['Euclidean', 'Minkowski', 'Canberra', 'Pearson', 'Mahalanobis']
d2_list = ['Sokal', 'Jaccard']
d3_list = ['Matching']
for d in itertools.product(d1_list, d2_list, d3_list) :
    Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1=d[0], d2=d[1], d3=d[2], q=1)
    D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=False)
    D_GG_list_not_maha_robust.append(D)
for d in itertools.product(['Robust_Mahalanobis'], d2_list, d3_list, ['trimmed', 'winsorized', 'MAD']) :
   Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1=d[0], d2=d[1], d3=d[2], epsilon=0.05, Method=d[3], alpha=0.1)
   D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=False)
   D_GG_list_maha_robust.append(D)
for d in itertools.product(d1_list, d2_list, d3_list) :
   Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1=d[0], d2=d[1], d3=d[2], q=1)
   D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=True, tol=0.009, d=2)
   D_RelMS_list_not_maha_robust.append(D)
for d in itertools.product(['Robust_Mahalanobis'], d2_list, d3_list, ['trimmed', 'winsorized', 'MAD']) :
   Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1=d[0], d2=d[1], d3=d[2], epsilon=0.05, Method=d[3], alpha=0.1)
   D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=True, tol=0.009, d=2)
   D_RelMS_list_maha_robust.append(D)
D_GG_list = D_GG_list_not_maha_robust + D_GG_list_maha_robust
D_RelMS_list = D_RelMS_list_not_maha_robust + D_RelMS_list_maha_robust
search_space = [x  for x in D_GG_list] + [x  for x in D_RelMS_list]
distance_names = ['GG_'+x[0]+'_'+x[1]+'_'+x[2]  for x in itertools.product(d1_list, d2_list, d3_list)] + ['GG_'+x[0]+'_'+x[1]+'_'+x[2]+'_'+x[3] for x in itertools.product(['Robust_Mahalanobis'], d2_list, d3_list, ['trimmed', 'winsorized', 'MAD'])] + ['RelMS_'+x[0]+'_'+x[1]+'_'+x[2] for x in itertools.product(d1_list, d2_list, d3_list)] + ['RelMS_'+x[0]+'_'+x[1]+'_'+x[2]+'_'+x[3] for x in itertools.product(['Robust_Mahalanobis'], d2_list, d3_list, ['trimmed', 'winsorized', 'MAD'])]
dic_distance_matrix = dict(zip(distance_names, search_space))
dic_distance_matrix
{'GG_Euclidean_Sokal_Matching': array([[0.        , 1.01161446, 1.60800698, ..., 1.23798333, 1.92432848,
         6.35838514],
        [1.01161446, 0.        , 1.64229596, ..., 0.7889253 , 1.87696727,
         6.29319748],
        [1.60800698, 1.64229596, 0.        , ..., 1.42723912, 2.26882579,
         6.96673669],
        ...,
        [1.23798333, 0.7889253 , 1.42723912, ..., 0.        , 2.4635748 ,
         7.01727531],
        [1.92432848, 1.87696727, 2.26882579, ..., 2.4635748 , 0.        ,
         5.11270638],
        [6.35838514, 6.29319748, 6.96673669, ..., 7.01727531, 5.11270638,
         0.        ]]),
 'GG_Euclidean_Jaccard_Matching': array([[0.        , 1.01161446, 1.60800698, ..., 1.23798333, 1.92432848,
         6.21923207],
        [1.01161446, 0.        , 1.64229596, ..., 0.7889253 , 1.87696727,
         6.15257024],
        [1.60800698, 1.64229596, 0.        , ..., 1.42723912, 2.26882579,
         6.83997121],
        ...,
        [1.23798333, 0.7889253 , 1.42723912, ..., 0.        , 2.4635748 ,
         6.89143953],
        [1.92432848, 1.87696727, 2.26882579, ..., 2.4635748 , 0.        ,
         4.93857798],
        [6.21923207, 6.15257024, 6.83997121, ..., 6.89143953, 4.93857798,
         0.        ]]),
 'GG_Minkowski_Sokal_Matching': array([[0.        , 1.01161589, 1.60801451, ..., 1.23797549, 1.92440501,
         6.35838512],
        [1.01161589, 0.        , 1.64229192, ..., 0.78891568, 1.87702827,
         6.29317915],
        [1.60801451, 1.64229192, 0.        , ..., 1.42723962, 2.2688732 ,
         6.96667937],
        ...,
        [1.23797549, 0.78891568, 1.42723962, ..., 0.        , 2.46364348,
         7.01724763],
        [1.92440501, 1.87702827, 2.2688732 , ..., 2.46364348, 0.        ,
         5.11260609],
        [6.35838512, 6.29317915, 6.96667937, ..., 7.01724763, 5.11260609,
         0.        ]]),
 'GG_Minkowski_Jaccard_Matching': array([[0.        , 1.01161589, 1.60801451, ..., 1.23797549, 1.92440501,
         6.21923205],
        [1.01161589, 0.        , 1.64229192, ..., 0.78891568, 1.87702827,
         6.15255149],
        [1.60801451, 1.64229192, 0.        , ..., 1.42723962, 2.2688732 ,
         6.83991282],
        ...,
        [1.23797549, 0.78891568, 1.42723962, ..., 0.        , 2.46364348,
         6.89141134],
        [1.92440501, 1.87702827, 2.2688732 , ..., 2.46364348, 0.        ,
         4.93847416],
        [6.21923205, 6.15255149, 6.83991282, ..., 6.89141134, 4.93847416,
         0.        ]]),
 'GG_Canberra_Sokal_Matching': array([[0.        , 1.1089173 , 2.04873576, ..., 1.41070641, 2.47064802,
         3.88007815],
        [1.1089173 , 0.        , 1.81887649, ..., 1.10728448, 2.20656591,
         3.66760203],
        [2.04873576, 1.81887649, 0.        , ..., 1.51266848, 2.44536222,
         3.67890583],
        ...,
        [1.41070641, 1.10728448, 1.51266848, ..., 0.        , 2.92569072,
         4.05431191],
        [2.47064802, 2.20656591, 2.44536222, ..., 2.92569072, 0.        ,
         2.67423498],
        [3.88007815, 3.66760203, 3.67890583, ..., 4.05431191, 2.67423498,
         0.        ]]),
 'GG_Canberra_Jaccard_Matching': array([[0.        , 1.1089173 , 2.04873576, ..., 1.41070641, 2.47064802,
         3.64757349],
        [1.1089173 , 0.        , 1.81887649, ..., 1.10728448, 2.20656591,
         3.42068569],
        [2.04873576, 1.81887649, 0.        , ..., 1.51266848, 2.44536222,
         3.43280265],
        ...,
        [1.41070641, 1.10728448, 1.51266848, ..., 0.        , 2.92569072,
         3.83239234],
        [2.47064802, 2.20656591, 2.44536222, ..., 2.92569072, 0.        ,
         2.32407372],
        [3.64757349, 3.42068569, 3.43280265, ..., 3.83239234, 2.32407372,
         0.        ]]),
 'GG_Pearson_Sokal_Matching': array([[0.        , 1.0588577 , 1.62258227, ..., 1.13386485, 2.59878376,
         4.5833716 ],
        [1.0588577 , 0.        , 1.54980561, ..., 0.55073019, 2.36782324,
         4.41160916],
        [1.62258227, 1.54980561, 0.        , ..., 1.48883715, 2.15643298,
         4.46893998],
        ...,
        [1.13386485, 0.55073019, 1.48883715, ..., 0.        , 2.64592015,
         4.75194328],
        [2.59878376, 2.36782324, 2.15643298, ..., 2.64592015, 0.        ,
         3.34753806],
        [4.5833716 , 4.41160916, 4.46893998, ..., 4.75194328, 3.34753806,
         0.        ]]),
 'GG_Pearson_Jaccard_Matching': array([[0.        , 1.0588577 , 1.62258227, ..., 1.13386485, 2.59878376,
         4.38828909],
        [1.0588577 , 0.        , 1.54980561, ..., 0.55073019, 2.36782324,
         4.20857237],
        [1.62258227, 1.54980561, 0.        , ..., 1.48883715, 2.15643298,
         4.26863098],
        ...,
        [1.13386485, 0.55073019, 1.48883715, ..., 0.        , 2.64592015,
         4.56407174],
        [2.59878376, 2.36782324, 2.15643298, ..., 2.64592015, 0.        ,
         3.07502796],
        [4.38828909, 4.20857237, 4.26863098, ..., 4.56407174, 3.07502796,
         0.        ]]),
 'GG_Mahalanobis_Sokal_Matching': array([[0.        , 1.11128701, 1.9908619 , ..., 1.26642065, 2.97833241,
         4.17851469],
        [1.11128701, 0.        , 1.73337267, ..., 0.49510815, 2.64311668,
         4.11353573],
        [1.9908619 , 1.73337267, 0.        , ..., 1.5815777 , 1.99507289,
         4.39053781],
        ...,
        [1.26642065, 0.49510815, 1.5815777 , ..., 0.        , 2.63417571,
         4.3979867 ],
        [2.97833241, 2.64311668, 1.99507289, ..., 2.63417571, 0.        ,
         4.4698317 ],
        [4.17851469, 4.11353573, 4.39053781, ..., 4.3979867 , 4.4698317 ,
         0.        ]]),
 'GG_Mahalanobis_Jaccard_Matching': array([[0.        , 1.11128701, 1.9908619 , ..., 1.26642065, 2.97833241,
         3.96355535],
        [1.11128701, 0.        , 1.73337267, ..., 0.49510815, 2.64311668,
         3.89499193],
        [1.9908619 , 1.73337267, 0.        , ..., 1.5815777 , 1.99507289,
         4.18647921],
        ...,
        [1.26642065, 0.49510815, 1.5815777 , ..., 0.        , 2.63417571,
         4.19429052],
        [2.97833241, 2.64311668, 1.99507289, ..., 2.63417571, 0.        ,
         4.26956454],
        [3.96355535, 3.89499193, 4.18647921, ..., 4.19429052, 4.26956454,
         0.        ]]),
 'GG_Robust_Mahalanobis_Sokal_Matching_trimmed': array([[0.        , 1.0738818 , 1.81990287, ..., 1.17982158, 2.83584093,
         4.38026385],
        [1.0738818 , 0.        , 1.64744788, ..., 0.39866732, 2.61869851,
         4.3233478 ],
        [1.81990287, 1.64744788, 0.        , ..., 1.53344794, 1.97466567,
         4.56660697],
        ...,
        [1.17982158, 0.39866732, 1.53344794, ..., 0.        , 2.54962302,
         4.5492545 ],
        [2.83584093, 2.61869851, 1.97466567, ..., 2.54962302, 0.        ,
         5.16721825],
        [4.38026385, 4.3233478 , 4.56660697, ..., 4.5492545 , 5.16721825,
         0.        ]]),
 'GG_Robust_Mahalanobis_Sokal_Matching_winsorized': array([[0.        , 1.10035027, 1.96521318, ..., 1.24876507, 3.02193061,
         4.2158267 ],
        [1.10035027, 0.        , 1.72244788, ..., 0.45786845, 2.71169847,
         4.170886  ],
        [1.96521318, 1.72244788, 0.        , ..., 1.57396145, 2.01907767,
         4.45138733],
        ...,
        [1.24876507, 0.45786845, 1.57396145, ..., 0.        , 2.6589383 ,
         4.42575055],
        [3.02193061, 2.71169847, 2.01907767, ..., 2.6589383 , 0.        ,
         4.74960743],
        [4.2158267 , 4.170886  , 4.45138733, ..., 4.42575055, 4.74960743,
         0.        ]]),
 'GG_Robust_Mahalanobis_Sokal_Matching_MAD': array([[0.        , 1.09006233, 1.80375514, ..., 1.18201607, 2.67497233,
         4.55678538],
        [1.09006233, 0.        , 1.62058379, ..., 0.44488228, 2.40606721,
         4.40232615],
        [1.80375514, 1.62058379, 0.        , ..., 1.53278692, 1.93813141,
         4.46679441],
        ...,
        [1.18201607, 0.44488228, 1.53278692, ..., 0.        , 2.48916367,
         4.64371521],
        [2.67497233, 2.40606721, 1.93813141, ..., 2.48916367, 0.        ,
         4.16671594],
        [4.55678538, 4.40232615, 4.46679441, ..., 4.64371521, 4.16671594,
         0.        ]]),
 'GG_Robust_Mahalanobis_Jaccard_Matching_trimmed': array([[0.        , 1.0738818 , 1.81990287, ..., 1.17982158, 2.83584093,
         4.17570322],
        [1.0738818 , 0.        , 1.64744788, ..., 0.39866732, 2.61869851,
         4.11595944],
        [1.81990287, 1.64744788, 0.        , ..., 1.53344794, 1.97466567,
         4.37077626],
        ...,
        [1.17982158, 0.39866732, 1.53344794, ..., 0.        , 2.54962302,
         4.35264315],
        [2.83584093, 2.61869851, 1.97466567, ..., 2.54962302, 0.        ,
         4.99499053],
        [4.17570322, 4.11595944, 4.37077626, ..., 4.35264315, 4.99499053,
         0.        ]]),
 'GG_Robust_Mahalanobis_Jaccard_Matching_winsorized': array([[0.        , 1.10035027, 1.96521318, ..., 1.24876507, 3.02193061,
         4.00287155],
        [1.10035027, 0.        , 1.72244788, ..., 0.45786845, 2.71169847,
         3.95551209],
        [1.96521318, 1.72244788, 0.        , ..., 1.57396145, 2.01907767,
         4.25025118],
        ...,
        [1.24876507, 0.45786845, 1.57396145, ..., 0.        , 2.6589383 ,
         4.22339365],
        [3.02193061, 2.71169847, 2.01907767, ..., 2.6589383 , 0.        ,
         4.5616397 ],
        [4.00287155, 3.95551209, 4.25025118, ..., 4.22339365, 4.5616397 ,
         0.        ]]),
 'GG_Robust_Mahalanobis_Jaccard_Matching_MAD': array([[0.        , 1.09006233, 1.80375514, ..., 1.18201607, 2.67497233,
         4.36051361],
        [1.09006233, 0.        , 1.62058379, ..., 0.44488228, 2.40606721,
         4.19884049],
        [1.80375514, 1.62058379, 0.        , ..., 1.53278692, 1.93813141,
         4.26638468],
        ...,
        [1.18201607, 0.44488228, 1.53278692, ..., 0.        , 2.48916367,
         4.45127812],
        [2.67497233, 2.40606721, 1.93813141, ..., 2.48916367, 0.        ,
         3.95111474],
        [4.36051361, 4.19884049, 4.26638468, ..., 4.45127812, 3.95111474,
         0.        ]]),
 'RelMS_Euclidean_Sokal_Matching': array([[0.        , 1.01092438, 1.68587263, ..., 1.2435966 , 1.75479379,
         5.76354972],
        [1.01092436, 0.        , 1.72123768, ..., 0.78892531, 1.71977376,
         5.69924943],
        [1.68587264, 1.7212377 , 0.        , ..., 1.42997022, 2.20660915,
         6.5504967 ],
        ...,
        [1.24359658, 0.78892532, 1.42997021, ..., 0.        , 2.26671431,
         6.42377887],
        [1.7547938 , 1.71977375, 2.20660914, ..., 2.26671431, 0.        ,
         4.781135  ],
        [5.76354972, 5.69924943, 6.55049671, ..., 6.42377887, 4.78113499,
         0.        ]]),
 'RelMS_Euclidean_Jaccard_Matching': array([[0.        , 1.01092435, 1.68587263, ..., 1.24359659, 1.75479381,
         5.73873464],
        [1.01092437, 0.        , 1.72123769, ..., 0.78892532, 1.71977378,
         5.67208311],
        [1.68587264, 1.72123769, 0.        , ..., 1.42997021, 2.20660914,
         6.53309456],
        ...,
        [1.24359658, 0.78892529, 1.42997021, ..., 0.        , 2.26671431,
         6.41402297],
        [1.7547938 , 1.71977375, 2.20660914, ..., 2.2667143 , 0.        ,
         4.6957284 ],
        [5.73873463, 5.67208312, 6.53309457, ..., 6.41402297, 4.69572838,
         0.        ]]),
 'RelMS_Minkowski_Sokal_Matching': array([[0.        , 1.0104344 , 1.68473307, ..., 1.24302039, 1.75451827,
         5.7636572 ],
        [1.01043437, 0.        , 1.72039524, ..., 0.78891568, 1.71978231,
         5.69946617],
        [1.68473308, 1.72039525, 0.        , ..., 1.42922921, 2.20651554,
         6.55109162],
        ...,
        [1.24302037, 0.7889157 , 1.4292292 , ..., 0.        , 2.2667207 ,
         6.42402052],
        [1.75451827, 1.71978229, 2.20651553, ..., 2.2667207 , 0.        ,
         4.78235997],
        [5.7636572 , 5.69946616, 6.55109161, ..., 6.42402052, 4.78235997,
         0.        ]]),
 'RelMS_Minkowski_Jaccard_Matching': array([[0.        , 1.01043437, 1.68473307, ..., 1.24302038, 1.75451828,
         5.73875343],
        [1.01043439, 0.        , 1.72039525, ..., 0.78891569, 1.71978232,
         5.67221733],
        [1.68473307, 1.72039524, 0.        , ..., 1.4292292 , 2.20651553,
         6.5336026 ],
        ...,
        [1.24302038, 0.78891568, 1.4292292 , ..., 0.        , 2.2667207 ,
         6.41417732],
        [1.75451828, 1.7197823 , 2.20651553, ..., 2.2667207 , 0.        ,
         4.6969009 ],
        [5.73875342, 5.67221732, 6.5336026 , ..., 6.41417732, 4.6969009 ,
         0.        ]]),
 'RelMS_Canberra_Sokal_Matching': array([[0.        , 3.29475825, 3.63767326, ..., 3.42002989, 3.78234978,
         4.28387746],
        [3.29475817, 0.        , 3.54627477, ..., 3.36365755, 3.64707779,
         4.11290306],
        [3.63767327, 3.5462748 , 0.        , ..., 3.36371231, 3.88636668,
         4.26421609],
        ...,
        [3.42002989, 3.36365756, 3.36371231, ..., 0.        , 4.08835735,
         4.43146723],
        [3.78234979, 3.64707779, 3.88636667, ..., 4.08835736, 0.        ,
         3.55682862],
        [4.28387745, 4.11290305, 4.26421607, ..., 4.43146723, 3.55682862,
         0.        ]]),
 'RelMS_Canberra_Jaccard_Matching': array([[0.        , 3.29475816, 3.63767325, ..., 3.42002988, 3.7823498 ,
         4.18398249],
        [3.29475818, 0.        , 3.54627479, ..., 3.36365756, 3.64707782,
         4.00084943],
        [3.63767326, 3.54627478, 0.        , ..., 3.36371229, 3.88636666,
         4.15092751],
        ...,
        [3.42002988, 3.36365755, 3.36371228, ..., 0.        , 4.08835736,
         4.3378168 ],
        [3.78234979, 3.64707778, 3.88636666, ..., 4.08835735, 0.        ,
         3.36218137],
        [4.18398248, 4.00084941, 4.15092752, ..., 4.3378168 , 3.36218137,
         0.        ]]),
 'RelMS_Pearson_Sokal_Matching': array([[0.        , 1.04250916, 1.57029271, ..., 1.11835441, 2.35030151,
         3.99961285],
        [1.04250913, 0.        , 1.55642417, ..., 0.55073019, 2.17276224,
         3.83629275],
        [1.5702927 , 1.55642418, 0.        , ..., 1.44481248, 2.11094744,
         4.05200057],
        ...,
        [1.11835439, 0.55073021, 1.44481248, ..., 0.        , 2.43447697,
         4.16544183],
        [2.35030151, 2.17276223, 2.11094745, ..., 2.43447697, 0.        ,
         3.00502738],
        [3.99961283, 3.83629274, 4.05200056, ..., 4.16544183, 3.00502738,
         0.        ]]),
 'RelMS_Pearson_Jaccard_Matching': array([[0.        , 1.04250913, 1.57029271, ..., 1.11835441, 2.35030152,
         3.89789603],
        [1.04250915, 0.        , 1.55642418, ..., 0.55073023, 2.17276226,
         3.72479069],
        [1.5702927 , 1.55642415, 0.        , ..., 1.44481247, 2.11094744,
         3.94329467],
        ...,
        [1.11835439, 0.55073016, 1.44481248, ..., 0.        , 2.43447698,
         4.07654071],
        [2.35030152, 2.17276223, 2.11094745, ..., 2.43447697, 0.        ,
         2.77842982],
        [3.89789601, 3.72479067, 3.94329467, ..., 4.0765407 , 2.77842982,
         0.        ]]),
 'RelMS_Mahalanobis_Sokal_Matching': array([[0.        , 1.0872495 , 1.91566724, ..., 1.23718333, 2.78694322,
         3.59368169],
        [1.08724948, 0.        , 1.72190382, ..., 0.49510814, 2.51013925,
         3.52430362],
        [1.91566725, 1.72190383, 0.        , ..., 1.53860587, 1.97114821,
         3.91897956],
        ...,
        [1.23718333, 0.49510818, 1.53860586, ..., 0.        , 2.47401146,
         3.7944967 ],
        [2.78694323, 2.51013924, 1.97114821, ..., 2.47401146, 0.        ,
         4.10401609],
        [3.59368167, 3.52430361, 3.91897955, ..., 3.7944967 , 4.10401609,
         0.        ]]),
 'RelMS_Mahalanobis_Jaccard_Matching': array([[0.        , 1.08724947, 1.91566724, ..., 1.23718333, 2.78694323,
         3.46907215],
        [1.0872495 , 0.        , 1.72190383, ..., 0.49510817, 2.51013926,
         3.39550188],
        [1.91566724, 1.72190381, 0.        , ..., 1.53860586, 1.97114821,
         3.80535063],
        ...,
        [1.23718333, 0.49510812, 1.53860586, ..., 0.        , 2.47401147,
         3.68911387],
        [2.78694323, 2.51013924, 1.97114821, ..., 2.47401147, 0.        ,
         3.96214705],
        [3.46907213, 3.39550187, 3.80535063, ..., 3.68911387, 3.96214705,
         0.        ]]),
 'RelMS_Robust_Mahalanobis_Sokal_Matching_trimmed': array([[0.        , 1.05396495, 1.74951184, ..., 1.15390312, 2.67058462,
         3.82780883],
        [1.05396493, 0.        , 1.63479812, ..., 0.39866731, 2.51224528,
         3.76362714],
        [1.74951185, 1.63479814, 0.        , ..., 1.49657109, 1.961588  ,
         4.09825745],
        ...,
        [1.15390311, 0.39866735, 1.49657109, ..., 0.        , 2.41854434,
         3.97375586],
        [2.67058463, 2.51224527, 1.961588  , ..., 2.41854434, 0.        ,
         4.81269468],
        [3.82780882, 3.76362713, 4.09825744, ..., 3.97375586, 4.81269468,
         0.        ]]),
 'RelMS_Robust_Mahalanobis_Sokal_Matching_winsorized': array([[0.        , 1.07688717, 1.88851059, ..., 1.21940102, 2.83800382,
         3.64003684],
        [1.07688713, 0.        , 1.70819251, ..., 0.45786842, 2.58662722,
         3.59029333],
        [1.8885106 , 1.70819253, 0.        , ..., 1.53220354, 1.99808026,
         3.97860895],
        ...,
        [1.21940101, 0.45786849, 1.53220353, ..., 0.        , 2.50787408,
         3.829693  ],
        [2.83800382, 2.58662721, 1.99808026, ..., 2.50787408, 0.        ,
         4.38739858],
        [3.64003683, 3.59029333, 3.97860894, ..., 3.829693  , 4.38739858,
         0.        ]]),
 'RelMS_Robust_Mahalanobis_Sokal_Matching_MAD': array([[0.        , 1.06915308, 1.73228661, ..., 1.15789936, 2.45834684,
         3.97049139],
        [1.06915305, 0.        , 1.61195487, ..., 0.44488227, 2.24973009,
         3.81621214],
        [1.73228661, 1.61195488, 0.        , ..., 1.4894837 , 1.90536576,
         4.00431571],
        ...,
        [1.15789934, 0.44488231, 1.4894837 , ..., 0.        , 2.30824179,
         4.04102682],
        [2.45834685, 2.24973009, 1.90536577, ..., 2.30824178, 0.        ,
         3.79967402],
        [3.97049139, 3.81621213, 4.0043157 , ..., 4.04102682, 3.79967402,
         0.        ]]),
 'RelMS_Robust_Mahalanobis_Jaccard_Matching_trimmed': array([[0.        , 1.05396492, 1.74951184, ..., 1.15390312, 2.67058463,
         3.7103996 ],
        [1.05396495, 0.        , 1.63479813, ..., 0.39866734, 2.51224529,
         3.64245313],
        [1.74951185, 1.63479812, 0.        , ..., 1.49657109, 1.961588  ,
         3.98729219],
        ...,
        [1.15390311, 0.39866728, 1.49657109, ..., 0.        , 2.41854435,
         3.87035377],
        [2.67058464, 2.51224527, 1.961588  , ..., 2.41854434, 0.        ,
         4.69932707],
        [3.71039959, 3.64245311, 3.9872922 , ..., 3.87035377, 4.69932707,
         0.        ]]),
 'RelMS_Robust_Mahalanobis_Jaccard_Matching_winsorized': array([[0.        , 1.07688714, 1.88851059, ..., 1.21940102, 2.83800383,
         3.51619033],
        [1.07688715, 0.        , 1.70819252, ..., 0.45786846, 2.58662723,
         3.46347473],
        [1.88851059, 1.70819251, 0.        , ..., 1.53220354, 1.99808026,
         3.86606614],
        ...,
        [1.219401  , 0.45786843, 1.53220353, ..., 0.        , 2.50787409,
         3.72394257],
        [2.83800382, 2.58662721, 1.99808026, ..., 2.50787408, 0.        ,
         4.25828147],
        [3.51619032, 3.46347472, 3.86606614, ..., 3.72394256, 4.25828147,
         0.        ]]),
 'RelMS_Robust_Mahalanobis_Jaccard_Matching_MAD': array([[0.        , 1.06915304, 1.73228661, ..., 1.15789935, 2.45834686,
         3.86694579],
        [1.06915307, 0.        , 1.61195488, ..., 0.4448823 , 2.24973011,
         3.7045599 ],
        [1.7322866 , 1.61195486, 0.        , ..., 1.48948369, 1.90536575,
         3.89571711],
        ...,
        [1.15789934, 0.44488225, 1.48948369, ..., 0.        , 2.30824179,
         3.9478467 ],
        [2.45834686, 2.24973009, 1.90536576, ..., 2.30824179, 0.        ,
         3.64285626],
        [3.86694578, 3.70455988, 3.8957171 , ..., 3.9478467 , 3.64285626,
         0.        ]])}

Computational Cost Testing

In this case, we are going to use the entire House_Price.csv dataset, which has 1905 rows, to perform a computational cost test (in terms of time) of the new distance metrics included in PyDistances.

Data = pd.read_csv('House_Price.csv')
Data = Data.loc[:, ['latitude', 'longitude', 'price', 'size_in_m_2', 'balcony_recode', 'private_garden_recode', 'private_gym_recode', 'quality_recode', 'no_of_bathrooms', 'no_of_bedrooms']]
Data.shape
(1905, 10)
Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='trimmed', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=False)

# Time: 1.11 minutes.
Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='winsorized', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=False)

# Time: 1.15 minutes.
Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='MAD', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=False)

# Time: 1.12 minutes.
Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='trimmed', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=True)

# Time: 1.58 minutes.
Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='winsorized', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=True)

# Time: 1.53 minutes.
Generalized_Gower_Distance_init = GeneralizedGowerDistance(Data=Data, p1=4, p2=3, p3=3, d1='Robust_Mahalanobis', d2='Jaccard', d3='Matching', epsilon=0.05, Method='MAD', alpha=0.1)
D, D_2 = Generalized_Gower_Distance_init.compute(Related_Metric_Scaling=True)

# Time: 1.55 minutes.

We can compare these times with the one obtained by (simple) Gower distance.

Gower_Dist_Matrix(Data, p1=4, p2=3, p3=3)

# Time: 38 seconds.

Bibliography

Albarrán, I., P. Alonso, and A. Grané “Profile Identification via Weighted Related Metric Scaling: An Application to Dependent Spanish Children.” Journal of the Royal Statistical Society. Series A, Statistics in Society 178, no. 3 (2015): 593–618. https://doi.org/10.1111/rssa.12084stex:B88856BB540BB0134A72028E02D7B00CBED08217.

Cuadras, C. M., and J. Fortiana. “Chapter 25 - Visualizing Categorical Data with Related Metric Scaling.” In Visualization of Categorical Data, 365–76. Academic Press, 1998. https://doi.org/10.1016/B978-012299045-8/50028-0.

Devlin, S. J., R. Gnanadesikan, and J. R. Kettenring. “Robust Estimation and Outlier Detection with Correlation Coefficients.” Biometrika 62, no. 3 (1975): 531–45. https://doi.org/10.1093/biomet/62.3.531.

Grané, A., Manzi G. and S. Salini. "Smart Visualization of Mixed Data". Stats n.º 4 (2021): 472–485. https://doi.org/10.3390/stats4020029

Gower, J. C. “A General Coefficient of Similarity and Some of Its Properties.” Biometrics 27, no. 4 (1971): 857–71. https://doi.org/10.2307/2528823.

Gnanadesikan, R. Methods for Statistical Data Analysis of Multivariate Observations. 2nd ed. New York etc.: : John Wiley and Sons, 1997.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyDistances-0.0.18.tar.gz (39.6 kB view hashes)

Uploaded Source

Built Distribution

PyDistances-0.0.18-py3-none-any.whl (22.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page