genieclust/0000755000176200001440000000000014661666071012431 5ustar liggesusersgenieclust/MD50000644000176200001440000000606214661666071012745 0ustar liggesuserscf070792e94528f7dc225d8188c39fee *DESCRIPTION 4c14862dcc3f26228f3aced0b762dbdd *NAMESPACE 56ff6b408321284f91d4258d36fc51d6 *NEWS 611040bc8ed9cfa6bf7c3dadf5ee61f7 *R/RcppExports.R 5baae9f12cfbfa805c7386664569d139 *R/gclust.R 98ba204b1947f498f59d7758e7c3a1fd *R/genieclust-package.R 5103d78d8e5a6f1957d7c3be3772747a *R/mst.R e2f4edbfaeeb26defe2f3a12090b067a *inst/CITATION 5f200761a4e1ce5029266818ec20ce32 *man/cluster_validity.Rd bfd2f150a796fe63c645dc7e113790df *man/compare_partitions.Rd 80b52d175e78753b1ef4be3e1c9c1482 *man/emst_mlpack.Rd 6272c337e16e552f4cd86d6d91ec9e32 *man/gclust.Rd 229b03e9f36473ec90642b58721f2603 *man/genieclust-package.Rd c72b632b21650be436b6db28fd4fd490 *man/inequality.Rd 7c52b504818ccf7031884f30e09385c6 *man/mst.Rd f223ffc6370b5009b4cf94ba55115387 *src/Makevars f223ffc6370b5009b4cf94ba55115387 *src/Makevars.win 52605c3b51ba6408149dd19d731fea9c *src/RcppExports.cpp d520bf395c661d3b399ffabf67e9c469 *src/c_argfuns.h 4f28e8d1dacbf90f4fe39e18982aa059 *src/c_common.h 813f188e9610d44ab1170c1a6f219196 *src/c_compare_partitions.h 5a601e5b5a44f15f94f65630e332b166 *src/c_cvi.h e6d7f15248b55178a5a1144db60eea1d *src/c_disjoint_sets.h 43cf8b11a84a4eea0cf30af6f905bbc1 *src/c_distance.h d22afcec6cf1b22af7e226c930791551 *src/c_genie.h 428046708a7b6c81710340816753834b *src/c_gini_disjoint_sets.h 5b8dd6579a9d8f9693871c33f9acb8c8 *src/c_inequality.h 598d66fa3b12aacc1a449fa66597e4d2 *src/c_int_dict.h 2696919283dda47565be93555e49f071 *src/c_matrix.h 0d4cc145443fca7720c26022ffba6cbb *src/c_mst.h 6501bc1f458a90a76034a50badc2d3a1 *src/c_postprocess.h c4513b0baf629c756ebfe9d3c258d62d *src/c_preprocess.h f909e8fd22d95abd176aa102e222402f *src/c_scipy_rectangular_lsap.h e897dd9f8857df77e05b4d64b1d825b7 *src/cvi.h 59cd525829905fcd4083ecdb090c05e9 *src/cvi_calinski_harabasz.h 3f1e66381f5fc61ae6ec7f6e409f803e *src/cvi_davies_bouldin.h 87a3ef1fcd0e5abef274d4755f4a33f6 *src/cvi_dunnowa.h a9fe5c0f69fa5e74b028dd2d075dab63 *src/cvi_generalized_dunn.h 0bb77a27dd8003bc141c336b97b3b9d0 *src/cvi_generalized_dunn_delta.h 4349fb339add628679bd99b6c79e3fe0 *src/cvi_generalized_dunn_lowercase_d1.h e10dbfd111241e9461b2eb0347d47472 *src/cvi_generalized_dunn_lowercase_d2.h 3091614cfd0484692b8634f7d91bbfb3 *src/cvi_generalized_dunn_lowercase_d3.h 25cbb89ed69e292f3724a52c170ffcee *src/cvi_generalized_dunn_lowercase_d4.h 8c102ae0d3cc7f67eab9c21e61409122 *src/cvi_generalized_dunn_lowercase_d5.h 6a7407a60f05975fdb39d1a88e4cfe37 *src/cvi_generalized_dunn_lowercase_d6.h 2e9afde634bffcb0b5f84e6070b63dc7 *src/cvi_generalized_dunn_uppercase_d1.h 222ce3b8197711ec43349ac6584fc132 *src/cvi_generalized_dunn_uppercase_d2.h bba38777f77b92ab02c460361b3ecc02 *src/cvi_generalized_dunn_uppercase_d3.h d800e4754c932d383908552585a67119 *src/cvi_silhouette.h 6e1aef6a19e90e03e50bb2d6608c8169 *src/cvi_wcnn.h d897599f11488b557a68f8743c1daef6 *src/cvi_wcss.h a065d9e1e6aedf5baa2a8ab0afd20b3c *src/r_compare_partitions.cpp edc827776f2b683cdee562372762e5d8 *src/r_cvi.cpp 5a98901a3518dc0379a8bd2acea3908a *src/r_emst.cpp 4c2347e08e0f9f0144948e7a3606065d *src/r_gclust.cpp 9af99b4909d473899715c241e71aab19 *src/r_inequality.cpp genieclust/R/0000755000176200001440000000000014661572145012630 5ustar liggesusersgenieclust/R/RcppExports.R0000644000176200001440000004357114661572145015256 0ustar liggesusers# Generated by using Rcpp::compileAttributes() -> do not edit by hand # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 #' @title External Cluster Validity Measures and Pairwise Partition Similarity Scores #' #' @description #' The functions described in this section quantify the similarity between #' two label vectors \code{x} and \code{y} which represent two partitions #' of a set of \eqn{n} elements into, respectively, \eqn{K} and \eqn{L} #' nonempty and pairwise disjoint subsets. #' #' For instance, \code{x} and \code{y} can represent two clusterings #' of a dataset with \eqn{n} observations specified by two vectors #' of labels. The functions described here can be used as external cluster #' validity measures, where we assume that \code{x} is #' a reference (ground-truth) partition whilst \code{y} is the vector #' of predicted cluster memberships. #' #' All indices except \code{normalized_clustering_accuracy()} #' can act as a pairwise partition similarity score: they are symmetric, #' i.e., \code{index(x, y) == index(y, x)}. #' #' Each index except \code{mi_score()} (which computes the mutual #' information score) outputs 1 given two identical partitions. #' Note that partitions are always defined up to a permutation (bijection) #' of the set of possible labels, e.g., (1, 1, 2, 1) and (4, 4, 2, 4) #' represent the same 2-partition. #' #' @details #' \code{normalized_clustering_accuracy()} (Gagolewski, 2023) #' is an asymmetric external cluster validity measure #' which assumes that the label vector \code{x} (or rows in the confusion #' matrix) represents the reference (ground truth) partition. #' It is an average proportion of correctly classified points in each cluster #' above the worst case scenario of uniform membership assignment, #' with cluster ID matching based on the solution to the maximal linear #' sum assignment problem; see \code{\link{normalized_confusion_matrix}}). #' It is given by: #' \eqn{\max_\sigma \frac{1}{K} \sum_{j=1}^K \frac{c_{\sigma(j), j}-c_{\sigma(j),\cdot}/K}{c_{\sigma(j),\cdot}-c_{\sigma(j),\cdot}/K}}, #' where \eqn{C} is a confusion matrix with \eqn{K} rows and \eqn{L} columns, #' \eqn{\sigma} is a permutation of the set \eqn{\{1,\dots,\max(K,L)\}}, and #' \eqn{c_{i, \cdot}=c_{i, 1}+...+c_{i, L}} is the i-th row sum, #' under the assumption that \eqn{c_{i,j}=0} for \eqn{i>K} or \eqn{j>L} #' and \eqn{0/0=0}. #' #' \code{normalized_pivoted_accuracy()} is defined as #' \eqn{(\max_\sigma \sum_{j=1}^{\max(K,L)} c_{\sigma(j),j}/n-1/\max(K,L))/(1-1/\max(K,L))}, #' where \eqn{\sigma} is a permutation of the set \eqn{\{1,\dots,\max(K,L)\}}, #' and \eqn{n} is the sum of all elements in \eqn{C}. #' For non-square matrices, missing rows/columns are assumed #' to be filled with 0s. #' #' \code{pair_sets_index()} (PSI) was introduced in (Rezaei, Franti, 2016). #' The simplified PSI assumes E=1 in the definition of the index, #' i.e., uses Eq. (20) in the said paper instead of Eq. (18). #' For non-square matrices, missing rows/columns are assumed #' to be filled with 0s. #' #' \code{rand_score()} gives the Rand score (the "probability" of agreement #' between the two partitions) and #' \code{adjusted_rand_score()} is its version corrected for chance, #' see (Hubert, Arabie, 1985): its expected value is 0 given two independent #' partitions. Due to the adjustment, the resulting index may be negative #' for some inputs. #' #' Similarly, \code{fm_score()} gives the Fowlkes-Mallows (FM) score #' and \code{adjusted_fm_score()} is its adjusted-for-chance version; #' see (Hubert, Arabie, 1985). #' #' \code{mi_score()}, \code{adjusted_mi_score()} and #' \code{normalized_mi_score()} are information-theoretic #' scores, based on mutual information, #' see the definition of \eqn{AMI_{sum}} and \eqn{NMI_{sum}} #' in (Vinh et al., 2010). #' #' #' \code{normalized_confusion_matrix()} computes the confusion matrix #' and permutes its rows and columns so that the sum of the elements #' of the main diagonal is the largest possible (by solving #' the maximal assignment problem). #' The function only accepts \eqn{K \leq L}. #' The reordering of the columns of a confusion matrix can be determined #' by calling \code{normalizing_permutation()}. #' #' Also note that the built-in #' \code{\link{table}()} determines the standard confusion matrix. #' #' #' @references #' Gagolewski M., A framework for benchmarking clustering algorithms, #' \emph{SoftwareX} 20, 2022, 101270, #' \doi{10.1016/j.softx.2022.101270}, #' \url{https://clustering-benchmarks.gagolewski.com}. #' #' Gagolewski M., Normalised clustering accuracy: An asymmetric external #' cluster validity measure, \emph{Journal of Classification}, 2024, in press, #' \doi{10.1007/s00357-024-09482-2}. #' #' Hubert L., Arabie P., Comparing partitions, #' \emph{Journal of Classification} 2(1), 1985, 193-218, esp. Eqs. (2) and (4). #' #' Meila M., Heckerman D., An experimental comparison of model-based clustering #' methods, \emph{Machine Learning} 42, 2001, pp. 9-29, #' \doi{10.1023/A:1007648401407}. #' #' Rezaei M., Franti P., Set matching measures for external cluster validity, #' \emph{IEEE Transactions on Knowledge and Data Mining} 28(8), 2016, #' 2173-2186. #' #' Steinley D., Properties of the Hubert-Arabie adjusted Rand index, #' \emph{Psychological Methods} 9(3), 2004, pp. 386-396, #' \doi{10.1037/1082-989X.9.3.386}. #' #' Vinh N.X., Epps J., Bailey J., #' Information theoretic measures for clusterings comparison: #' Variants, properties, normalization and correction for chance, #' \emph{Journal of Machine Learning Research} 11, 2010, 2837-2854. #' #' #' #' @param x an integer vector of length n (or an object coercible to) #' representing a K-partition of an n-set (e.g., a reference partition), #' or a confusion matrix with K rows and L columns #' (see \code{\link{table}(x, y)}) #' #' @param y an integer vector of length n (or an object coercible to) #' representing an L-partition of the same set (e.g., the output of a #' clustering algorithm we wish to compare with \code{x}), #' or NULL (if x is an K*L confusion matrix) #' #' @param simplified whether to assume E=1 in the definition of the pair sets index index, #' i.e., use Eq. (20) in (Rezaei, Franti, 2016) instead of Eq. (18) #' #' @param clipped whether the result should be clipped to the unit interval, i.e., [0, 1] #' #' #' @return Each cluster validity measure is a single numeric value. #' #' \code{normalized_confusion_matrix()} returns a numeric matrix. #' #' \code{normalizing_permutation()} returns a vector of indexes. #' #' #' @examples #' y_true <- iris[[5]] #' y_pred <- kmeans(as.matrix(iris[1:4]), 3)$cluster #' normalized_clustering_accuracy(y_true, y_pred) #' normalized_pivoted_accuracy(y_true, y_pred) #' pair_sets_index(y_true, y_pred) #' pair_sets_index(y_true, y_pred, simplified=TRUE) #' adjusted_rand_score(y_true, y_pred) #' rand_score(table(y_true, y_pred)) # the same #' adjusted_fm_score(y_true, y_pred) #' fm_score(y_true, y_pred) #' mi_score(y_true, y_pred) #' normalized_mi_score(y_true, y_pred) #' adjusted_mi_score(y_true, y_pred) #' normalized_confusion_matrix(y_true, y_pred) #' normalizing_permutation(y_true, y_pred) #' #' @rdname compare_partitions #' @name compare_partitions #' @export normalized_clustering_accuracy <- function(x, y = NULL) { .Call(`_genieclust_normalized_clustering_accuracy`, x, y) } #' @rdname compare_partitions #' @export normalized_pivoted_accuracy <- function(x, y = NULL) { .Call(`_genieclust_normalized_pivoted_accuracy`, x, y) } #' @rdname compare_partitions #' @export pair_sets_index <- function(x, y = NULL, simplified = FALSE, clipped = TRUE) { .Call(`_genieclust_pair_sets_index`, x, y, simplified, clipped) } #' @rdname compare_partitions #' @export adjusted_rand_score <- function(x, y = NULL, clipped = FALSE) { .Call(`_genieclust_adjusted_rand_score`, x, y, clipped) } #' @rdname compare_partitions #' @export rand_score <- function(x, y = NULL) { .Call(`_genieclust_rand_score`, x, y) } #' @rdname compare_partitions #' @export adjusted_fm_score <- function(x, y = NULL, clipped = FALSE) { .Call(`_genieclust_adjusted_fm_score`, x, y, clipped) } #' @rdname compare_partitions #' @export fm_score <- function(x, y = NULL) { .Call(`_genieclust_fm_score`, x, y) } #' @rdname compare_partitions #' @export mi_score <- function(x, y = NULL) { .Call(`_genieclust_mi_score`, x, y) } #' @rdname compare_partitions #' @export normalized_mi_score <- function(x, y = NULL) { .Call(`_genieclust_normalized_mi_score`, x, y) } #' @rdname compare_partitions #' @export adjusted_mi_score <- function(x, y = NULL, clipped = FALSE) { .Call(`_genieclust_adjusted_mi_score`, x, y, clipped) } #' @rdname compare_partitions #' @export normalized_confusion_matrix <- function(x, y = NULL) { .Call(`_genieclust_normalized_confusion_matrix`, x, y) } #' @rdname compare_partitions #' @export normalizing_permutation <- function(x, y = NULL) { .Call(`_genieclust_normalizing_permutation`, x, y) } #' @title Internal Cluster Validity Measures #' #' @description #' Implementation of a number of so-called cluster validity indices critically #' reviewed in (Gagolewski, Bartoszuk, Cena, 2021). See Section 2 #' therein and (Gagolewski, 2022) for the respective definitions. #' #' The greater the index value, the more \emph{valid} (whatever that means) #' the assessed partition. For consistency, the Ball-Hall and #' Davies-Bouldin indexes as well as the within-cluster sum of squares (WCSS) #' take negative values. #' #' #' @param X numeric matrix with \code{n} rows and \code{d} columns, #' representing \code{n} points in a \code{d}-dimensional space #' #' @param y vector of \code{n} integer labels, #' representing a partition whose \emph{quality} is to be #' assessed; \code{y[i]} is the cluster ID of the \code{i}-th point, #' \code{X[i, ]}; \code{1 <= y[i] <= K}, where \code{K} is the number #' or clusters #' #' @param M number of nearest neighbours #' #' @param lowercase_d an integer between 1 and 5, denoting #' \eqn{d_1}, ..., \eqn{d_5} in the definition #' of the generalised Dunn (Bezdek-Pal) index (numerator: #' min, max, and mean pairwise intracluster distance, #' distance between cluster centroids, #' weighted point-centroid distance, respectively) #' #' @param uppercase_d an integer between 1 and 3, denoting #' \eqn{D_1}, ..., \eqn{D_3} in the definition #' of the generalised Dunn (Bezdek-Pal) index (denominator: #' max and min pairwise intracluster distance, average point-centroid #' distance, respectively) #' #' @param owa_numerator,owa_denominator single string specifying #' the OWA operators to use in the definition of the DuNN index; #' one of: \code{"Mean"}, \code{"Min"}, \code{"Max"}, \code{"Const"}, #' \code{"SMin:D"}, \code{"SMax:D"}, where \code{D} is an integer #' defining the degree of smoothness #' #' #' @return #' A single numeric value (the more, the \emph{better}). #' #' @references #' Ball G.H., Hall D.J., #' \emph{ISODATA: A novel method of data analysis and pattern classification}, #' Technical report No. AD699616, Stanford Research Institute, 1965. #' #' Bezdek J., Pal N., Some new indexes of cluster validity, #' \emph{IEEE Transactions on Systems, Man, and Cybernetics, Part B} 28, #' 1998, 301-315, \doi{10.1109/3477.678624}. #' #' Calinski T., Harabasz J., A dendrite method for cluster analysis, #' \emph{Communications in Statistics} 3(1), 1974, 1-27, #' \doi{10.1080/03610927408827101}. #' #' Davies D.L., Bouldin D.W., #' A Cluster Separation Measure, #' \emph{IEEE Transactions on Pattern Analysis and Machine Intelligence} #' PAMI-1 (2), 1979, 224-227, \doi{10.1109/TPAMI.1979.4766909}. #' #' Dunn J.C., A Fuzzy Relative of the ISODATA Process and Its Use in Detecting #' Compact Well-Separated Clusters, \emph{Journal of Cybernetics} 3(3), 1973, #' 32-57, \doi{10.1080/01969727308546046}. #' #' Gagolewski M., Bartoszuk M., Cena A., #' Are cluster validity measures (in)valid?, \emph{Information Sciences} 581, #' 620-636, 2021, \doi{10.1016/j.ins.2021.10.004}; #' preprint: \url{https://raw.githubusercontent.com/gagolews/bibliography/master/preprints/2021cvi.pdf}. #' #' Gagolewski M., A Framework for Benchmarking Clustering Algorithms, #' \emph{SoftwareX} 20, 2022, 101270, #' \doi{10.1016/j.softx.2022.101270}, #' \url{https://clustering-benchmarks.gagolewski.com}. #' #' Rousseeuw P.J., Silhouettes: A Graphical Aid to the Interpretation and #' Validation of Cluster Analysis, \emph{Computational and Applied Mathematics} #' 20, 1987, 53-65, \doi{10.1016/0377-0427(87)90125-7}. #' #' #' #' @examples #' X <- as.matrix(iris[,1:4]) #' X[,] <- jitter(X) # otherwise we get a non-unique solution #' y <- as.integer(iris[[5]]) #' calinski_harabasz_index(X, y) # good #' calinski_harabasz_index(X, sample(1:3, nrow(X), replace=TRUE)) # bad #' #' @name cluster_validity #' @rdname cluster_validity #' @export calinski_harabasz_index <- function(X, y) { .Call(`_genieclust_calinski_harabasz_index`, X, y) } #' @rdname cluster_validity #' @export dunnowa_index <- function(X, y, M = 25L, owa_numerator = "SMin:5", owa_denominator = "Const") { .Call(`_genieclust_dunnowa_index`, X, y, M, owa_numerator, owa_denominator) } #' @rdname cluster_validity #' @export generalised_dunn_index <- function(X, y, lowercase_d, uppercase_d) { .Call(`_genieclust_generalised_dunn_index`, X, y, lowercase_d, uppercase_d) } #' @rdname cluster_validity #' @export negated_ball_hall_index <- function(X, y) { .Call(`_genieclust_negated_ball_hall_index`, X, y) } #' @rdname cluster_validity #' @export negated_davies_bouldin_index <- function(X, y) { .Call(`_genieclust_negated_davies_bouldin_index`, X, y) } #' @rdname cluster_validity #' @export negated_wcss_index <- function(X, y) { .Call(`_genieclust_negated_wcss_index`, X, y) } #' @rdname cluster_validity #' @export silhouette_index <- function(X, y) { .Call(`_genieclust_silhouette_index`, X, y) } #' @rdname cluster_validity #' @export silhouette_w_index <- function(X, y) { .Call(`_genieclust_silhouette_w_index`, X, y) } #' @rdname cluster_validity #' @export wcnn_index <- function(X, y, M = 25L) { .Call(`_genieclust_wcnn_index`, X, y, M) } .mst.default <- function(X, distance = "euclidean", M = 1L, cast_float32 = TRUE, verbose = FALSE) { .Call(`_genieclust_dot_mst_default`, X, distance, M, cast_float32, verbose) } .mst.dist <- function(d, M = 1L, verbose = FALSE) { .Call(`_genieclust_dot_mst_dist`, d, M, verbose) } .genie <- function(mst, k, gini_threshold, postprocess, detect_noise, verbose) { .Call(`_genieclust_dot_genie`, mst, k, gini_threshold, postprocess, detect_noise, verbose) } .gclust <- function(mst, gini_threshold, verbose) { .Call(`_genieclust_dot_gclust`, mst, gini_threshold, verbose) } #' @title Inequality Measures #' #' @description #' \code{gini_index()} gives the normalised Gini index, #' \code{bonferroni_index()} implements the Bonferroni index, and #' \code{devergottini_index()} implements the De Vergottini index. #' #' @details #' These indices can be used to quantify the "inequality" of a numeric sample. #' They can be conceived as normalised measures of data dispersion. #' For constant vectors (perfect equity), the indices yield values of 0. #' Vectors with all elements but one equal to 0 (perfect inequality), #' are assigned scores of 1. #' They follow the Pigou-Dalton principle (are Schur-convex): #' setting \eqn{x_i = x_i - h} and \eqn{x_j = x_j + h} with \eqn{h > 0} #' and \eqn{x_i - h \geq x_j + h} (taking from the "rich" and #' giving to the "poor") decreases the inequality #' #' These indices have applications in economics, amongst others. #' The Genie clustering algorithm uses the Gini index as a measure #' of the inequality of cluster sizes. #' #' #' The normalised Gini index is given by: #' \deqn{ #' G(x_1,\dots,x_n) = \frac{ #' \sum_{i=1}^{n} (n-2i+1) x_{\sigma(n-i+1)} #' }{ #' (n-1) \sum_{i=1}^n x_i #' }, #' } #' #' The normalised Bonferroni index is given by: #' \deqn{ #' B(x_1,\dots,x_n) = \frac{ #' \sum_{i=1}^{n} (n-\sum_{j=1}^i \frac{n}{n-j+1}) #' x_{\sigma(n-i+1)} #' }{ #' (n-1) \sum_{i=1}^n x_i #' }. #' } #' #' The normalised De Vergottini index is given by: #' \deqn{ #' V(x_1,\dots,x_n) = #' \frac{1}{\sum_{i=2}^n \frac{1}{i}} \left( #' \frac{ \sum_{i=1}^n \left( \sum_{j=i}^{n} \frac{1}{j}\right) #' x_{\sigma(n-i+1)} }{\sum_{i=1}^{n} x_i} - 1 #' \right). #' } #' #' Here, \eqn{\sigma} is an ordering permutation of \eqn{(x_1,\dots,x_n)}. #' #' Time complexity: \eqn{O(n)} for sorted (increasingly) data. #' Otherwise, the vector will be sorted. #' #' #' @references #' Bonferroni C., \emph{Elementi di Statistica Generale}, Libreria Seber, #' Firenze, 1930. #' #' Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and #' outlier-resistant hierarchical clustering algorithm, #' \emph{Information Sciences} 363, 2016, pp. 8-23. #' \doi{10.1016/j.ins.2016.05.003} #' #' Gini C., \emph{Variabilita e Mutabilita}, #' Tipografia di Paolo Cuppini, Bologna, 1912. #' #' #' @param x numeric vector of non-negative values #' #' @return The value of the inequality index, a number in \eqn{[0, 1]}. #' #' @examples #' gini_index(c(2, 2, 2, 2, 2)) # no inequality #' gini_index(c(0, 0, 10, 0, 0)) # one has it all #' gini_index(c(7, 0, 3, 0, 0)) # give to the poor, take away from the rich #' gini_index(c(6, 0, 3, 1, 0)) # (a.k.a. Pigou-Dalton principle) #' bonferroni_index(c(2, 2, 2, 2, 2)) #' bonferroni_index(c(0, 0, 10, 0, 0)) #' bonferroni_index(c(7, 0, 3, 0, 0)) #' bonferroni_index(c(6, 0, 3, 1, 0)) #' devergottini_index(c(2, 2, 2, 2, 2)) #' devergottini_index(c(0, 0, 10, 0, 0)) #' devergottini_index(c(7, 0, 3, 0, 0)) #' devergottini_index(c(6, 0, 3, 1, 0)) #' #' @name inequality #' @rdname inequality #' @export gini_index <- function(x) { .Call(`_genieclust_gini_index`, x) } #' @rdname inequality #' @export bonferroni_index <- function(x) { .Call(`_genieclust_bonferroni_index`, x) } #' @rdname inequality #' @export devergottini_index <- function(x) { .Call(`_genieclust_devergottini_index`, x) } genieclust/R/genieclust-package.R0000644000176200001440000000326214661566017016512 0ustar liggesusers# This file is part of the genieclust package for R. # ############################################################################ # # # # Copyleft (C) 2020-2024, Marek Gagolewski # # # # # # This program is free software: you can redistribute it and/or modify # # it under the terms of the GNU Affero General Public License # # Version 3, 19 November 2007, published by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # # but WITHOUT ANY WARRANTY; without even the implied warranty of # # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # # GNU Affero General Public License Version 3 for more details. # # You should have received a copy of the License along with this program. # # If this is not the case, refer to . # # # # ############################################################################ # #' @title The Genie Hierarchical Clustering Algorithm (with Extras) #' #' @description #' See \code{\link{genie}()} for more details. #' #' @useDynLib genieclust, .registration=TRUE #' @importFrom Rcpp evalCpp #' @importFrom stats hclust #' @importFrom stats cutree #' @importFrom stats dist #' @importFrom utils capture.output #' @keywords internal "_PACKAGE" genieclust/R/gclust.R0000644000176200001440000002723114650456171014257 0ustar liggesusers# This file is part of the genieclust package for R. # ############################################################################ # # # # Copyleft (C) 2020-2024, Marek Gagolewski # # # # # # This program is free software: you can redistribute it and/or modify # # it under the terms of the GNU Affero General Public License # # Version 3, 19 November 2007, published by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # # but WITHOUT ANY WARRANTY; without even the implied warranty of # # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # # GNU Affero General Public License Version 3 for more details. # # You should have received a copy of the License along with this program. # # If this is not the case, refer to . # # # # ############################################################################ # .correct_height <- function(height) { # correction for the departure from ultrametricity # TODO: allow one choose? # cumsum(height) rev(cummin(rev(height))) } #' @title Hierarchical Clustering Algorithm Genie #' #' @description #' A reimplementation of \emph{Genie} - a robust and outlier resistant #' clustering algorithm (see Gagolewski, Bartoszuk, Cena, 2016). #' The Genie algorithm is based on a minimum spanning tree (MST) of the #' pairwise distance graph of a given point set. #' Just like the single linkage, it consumes the edges #' of the MST in an increasing order of weights. However, it prevents #' the formation of clusters of highly imbalanced sizes; once the Gini index #' (see \code{\link{gini_index}()}) of the cluster size distribution #' raises above \code{gini_threshold}, a forced merge of a point group #' of the smallest size is performed. Its appealing simplicity goes hand #' in hand with its usability; Genie often outperforms #' other clustering approaches on benchmark data, #' such as \url{https://github.com/gagolews/clustering-benchmarks}. #' #' The clustering can now also be computed with respect to the #' mutual reachability distance (based, e.g., on the Euclidean metric), #' which is used in the definition of the HDBSCAN* algorithm #' (see Campello et al., 2013). If \code{M} > 1, then the mutual reachability #' distance \eqn{m(i,j)} with smoothing factor \code{M} is used instead of the #' chosen "raw" distance \eqn{d(i,j)}. It holds \eqn{m(i,j)=\max(d(i,j), c(i), c(j))}, #' where \eqn{c(i)} is \eqn{d(i,k)} with \eqn{k} being the #' (\code{M}-1)-th nearest neighbour of \eqn{i}. #' This makes "noise" and "boundary" points being "pulled away" from each other. #' #' The Genie correction together with the smoothing factor \code{M} > 1 (note that #' \code{M} = 2 corresponds to the original distance) gives a robustified version of #' the HDBSCAN* algorithm that is able to detect a predefined number of #' clusters. Hence it does not dependent on the DBSCAN's somewhat magical #' \code{eps} parameter or the HDBSCAN's \code{min_cluster_size} one. #' #' #' @details #' Note that, as in the case of all the distance-based methods, #' the standardisation of the input features is definitely worth giving a try. #' #' If \code{d} is a numeric matrix or an object of class \code{dist}, #' \code{\link{mst}()} will be called to compute an MST, which generally #' takes at most \eqn{O(n^2)} time (the algorithm we provide is parallelised, #' environment variable \code{OMP_NUM_THREADS} controls the number of threads #' in use). However, see \code{\link{emst_mlpack}()} for a very fast alternative #' in the case of Euclidean spaces of (very) low dimensionality and \code{M} = 1. #' #' Given an minimum spanning tree, the algorithm runs in \eqn{O(n \sqrt{n})} time. #' Therefore, if you want to test different \code{gini_threshold}s, #' (or \code{k}s), it is best to explicitly compute the MST first. #' #' According to the algorithm's original definition, #' the resulting partition tree (dendrogram) might violate #' the ultrametricity property (merges might occur at levels that #' are not increasing w.r.t. a between-cluster distance). #' \code{gclust()} automatically corrects departures from #' ultrametricity by applying \code{height = rev(cummin(rev(height)))}. #' #' #' @param d a numeric matrix (or an object coercible to one, #' e.g., a data frame with numeric-like columns) or an #' object of class \code{dist}, see \code{\link[stats]{dist}} #' or an object of class \code{mst}, see \code{\link{mst}()}. #' @param gini_threshold threshold for the Genie correction, i.e., #' the Gini index of the cluster size distribution; #' Threshold of 1.0 disables the correction. #' Low thresholds highly penalise the formation of small clusters. #' @param distance metric used to compute the linkage, one of: #' \code{"euclidean"} (synonym: \code{"l2"}), #' \code{"manhattan"} (a.k.a. \code{"l1"} and \code{"cityblock"}), #' \code{"cosine"}. #' @param verbose logical; whether to print diagnostic messages #' and progress information. #' @param cast_float32 logical; whether to compute the distances using 32-bit #' instead of 64-bit precision floating-point arithmetic (up to 2x faster). #' @param ... further arguments passed to other methods. #' @param k the desired number of clusters to detect, \code{k} = 1 with \code{M} > 1 #' acts as a noise point detector. #' @param detect_noise whether the minimum spanning tree's leaves #' should be marked as noise points, defaults to \code{TRUE} if \code{M} > 1 #' for compatibility with HDBSCAN*. #' @param M smoothing factor; \code{M} <= 2 gives the selected \code{distance}; #' otherwise, the mutual reachability distance is used. #' @param postprocess one of \code{"boundary"} (default), \code{"none"} #' or \code{"all"}; in effect only if \code{M} > 1. #' By default, only "boundary" points are merged #' with their nearest "core" points (A point is a boundary point if it is #' a noise point and it's amongst its adjacent vertex's #' \code{M}-1 nearest neighbours). To force a classical #' k-partition of a data set (with no notion of noise), #' choose "all". #' #' #' @return #' \code{gclust()} computes the whole clustering hierarchy; it #' returns a list of class \code{hclust}, #' see \code{\link[stats]{hclust}}. Use \code{\link[stats]{cutree}} to obtain #' an arbitrary k-partition. #' #' \code{genie()} returns a \code{k}-partition - a vector with elements in 1,...,k, #' whose i-th element denotes the i-th input point's cluster identifier. #' Missing values (\code{NA}) denote noise points (if \code{detect_noise} #' is \code{TRUE}). #' #' @seealso #' \code{\link{mst}()} for the minimum spanning tree routines. #' #' \code{\link{adjusted_rand_score}()} (amongst others) for external #' cluster validity measures (partition similarity scores). #' #' #' @references #' Gagolewski M., Bartoszuk M., Cena A., #' Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, #' \emph{Information Sciences} 363, 2016, 8-23, #' \doi{10.1016/j.ins.2016.05.003}. #' #' Campello R.J.G.B., Moulavi D., Sander J., #' Density-based clustering based on hierarchical density estimates, #' \emph{Lecture Notes in Computer Science} 7819, 2013, 160-172, #' \doi{10.1007/978-3-642-37456-2_14}. #' #' Gagolewski M., Cena A., Bartoszuk M., Brzozowski L., #' Clustering with minimum spanning trees: How good can it be?, #' \emph{Journal of Classification}, 2024, in press, #' \doi{10.1007/s00357-024-09483-1}. #' #' #' @examples #' library("datasets") #' data("iris") #' X <- iris[1:4] #' h <- gclust(X) #' y_pred <- cutree(h, 3) #' y_test <- iris[,5] #' plot(iris[,2], iris[,3], col=y_pred, #' pch=as.integer(iris[,5]), asp=1, las=1) #' adjusted_rand_score(y_test, y_pred) #' pair_sets_index(y_test, y_pred) #' #' # Fast for low-dimensional Euclidean spaces: #' # h <- gclust(emst_mlpack(X)) #' #' @rdname gclust #' @export gclust <- function(d, ...) { UseMethod("gclust") } #' @export #' @rdname gclust #' @method gclust default gclust.default <- function(d, gini_threshold=0.3, distance=c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"), cast_float32=TRUE, verbose=FALSE, ...) { stopifnot(gini_threshold >= 0.0, gini_threshold <= 1.0) distance <- match.arg(distance) d <- as.matrix(d) gclust.mst(mst.default(d, M=1L, distance=distance, verbose=verbose, cast_float32=cast_float32), gini_threshold=gini_threshold, verbose=verbose) } #' @export #' @rdname gclust #' @method gclust dist gclust.dist <- function(d, gini_threshold=0.3, verbose=FALSE, ...) { stopifnot(gini_threshold >= 0.0, gini_threshold <= 1.0) gclust.mst(mst.dist(d, M=1L, verbose=verbose), gini_threshold=gini_threshold, verbose=verbose) } #' @export #' @rdname gclust #' @method gclust mst gclust.mst <- function(d, gini_threshold=0.3, verbose=FALSE, ...) { stopifnot(gini_threshold >= 0.0, gini_threshold <= 1.0) result <- .gclust(d, gini_threshold, verbose) result[["height"]] <- .correct_height(result[["height"]]) result[["labels"]] <- attr(d, "Labels") # yes, >L1L, cast_float32=TRUE, verbose=FALSE, ...) { stopifnot(gini_threshold >= 0.0, gini_threshold <= 1.0) postprocess <- match.arg(postprocess) distance <- match.arg(distance) d <- as.matrix(d) genie.mst(mst.default(d, M=M, distance=distance, verbose=verbose, cast_float32=cast_float32), k=k, gini_threshold=gini_threshold, postprocess=postprocess, detect_noise=detect_noise, verbose=verbose) } #' @export #' @rdname gclust #' @method genie dist genie.dist <- function(d, k, gini_threshold=0.3, M=1L, postprocess=c("boundary", "none", "all"), detect_noise=M>1L, verbose=FALSE, ...) { stopifnot(gini_threshold >= 0.0, gini_threshold <= 1.0) postprocess <- match.arg(postprocess) genie.mst(mst.dist(d, M=M, verbose=verbose), k=k, gini_threshold=gini_threshold, postprocess=postprocess, detect_noise=detect_noise, verbose=verbose) } #' @export #' @rdname gclust #' @method genie mst genie.mst <- function(d, k, gini_threshold=0.3, postprocess=c("boundary", "none", "all"), detect_noise=FALSE, verbose=FALSE, ...) { stopifnot(gini_threshold >= 0.0, gini_threshold <= 1.0) postprocess <- match.arg(postprocess) structure( .genie(d, k, gini_threshold, postprocess, detect_noise, verbose), names=attr(d, "Labels") ) } registerS3method("gclust", "default", "gclust.default") registerS3method("gclust", "dist", "gclust.dist") registerS3method("gclust", "mst", "gclust.mst") registerS3method("genie", "default", "genie.default") registerS3method("genie", "dist", "genie.dist") registerS3method("genie", "mst", "genie.mst") genieclust/R/mst.R0000644000176200001440000002067514661565075013574 0ustar liggesusers# This file is part of the genieclust package for R. # ############################################################################ # # # # Copyleft (C) 2020-2024, Marek Gagolewski # # # # # # This program is free software: you can redistribute it and/or modify # # it under the terms of the GNU Affero General Public License # # Version 3, 19 November 2007, published by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # # but WITHOUT ANY WARRANTY; without even the implied warranty of # # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # # GNU Affero General Public License Version 3 for more details. # # You should have received a copy of the License along with this program. # # If this is not the case, refer to . # # # # ############################################################################ # #' @title Minimum Spanning Tree of the Pairwise Distance Graph #' #' @description #' An parallelised implementation of a Jarnik (Prim/Dijkstra)-like #' algorithm for determining #' a(*) minimum spanning tree (MST) of a complete undirected graph #' representing a set of n points #' with weights given by a pairwise distance matrix. #' #' (*) Note that there might be multiple minimum trees spanning a given graph. #' #' @details #' If \code{d} is a numeric matrix of size \eqn{n p}, #' the \eqn{n (n-1)/2} distances are computed on the fly, so that \eqn{O(n M)} #' memory is used. #' #' #' The algorithm is parallelised; set the \code{OMP_NUM_THREADS} environment #' variable \code{\link[base]{Sys.setenv}} to control the number of threads #' used. #' #' Time complexity is \eqn{O(n^2)} for the method accepting an object of #' class \code{dist} and \eqn{O(p n^2)} otherwise. #' #' If \code{M} >= 2, then the mutual reachability distance \eqn{m(i,j)} #' with smoothing factor \code{M} (see Campello et al. 2013) #' is used instead of the chosen "raw" distance \eqn{d(i,j)}. #' It holds \eqn{m(i, j)=\max(d(i,j), c(i), c(j))}, where \eqn{c(i)} is #' \eqn{d(i, k)} with \eqn{k} being the (\code{M}-1)-th nearest neighbour of \eqn{i}. #' This makes "noise" and "boundary" points being "pulled away" from each other. #' Genie++ clustering algorithm (see \code{\link{gclust}}) #' with respect to the mutual reachability distance gains the ability to #' identify some observations are noise points. #' #' Note that the case \code{M} = 2 corresponds to the original distance, but we #' determine the 1-nearest neighbours separately as well, which is a bit #' suboptimal; you can file a feature request if this makes your data analysis #' tasks too slow. #' #' #' @seealso #' \code{\link{emst_mlpack}()} for a very fast alternative #' in case of (very) low-dimensional Euclidean spaces (and \code{M} = 1). #' #' #' @references #' Jarnik V., O jistem problemu minimalnim, #' \emph{Prace Moravske Prirodovedecke Spolecnosti} 6, 1930, 57-63. #' #' Olson C.F., Parallel algorithms for hierarchical clustering, #' \emph{Parallel Comput.} 21, 1995, 1313-1325. #' #' Prim R., Shortest connection networks and some generalisations, #' \emph{Bell Syst. Tech. J.} 36, 1957, 1389-1401. #' #' Campello R.J.G.B., Moulavi D., Sander J., #' Density-based clustering based on hierarchical density estimates, #' \emph{Lecture Notes in Computer Science} 7819, 2013, 160-172, #' \doi{10.1007/978-3-642-37456-2_14}. #' #' #' @param d either a numeric matrix (or an object coercible to one, #' e.g., a data frame with numeric-like columns) or an #' object of class \code{dist}, see \code{\link[stats]{dist}} #' #' @param distance metric used to compute the linkage, one of: #' \code{"euclidean"} (synonym: \code{"l2"}), #' \code{"manhattan"} (a.k.a. \code{"l1"} and \code{"cityblock"}), #' \code{"cosine"} #' #' @param M smoothing factor; \code{M} = 1 gives the selected \code{distance}; #' otherwise, the mutual reachability distance is used #' #' @param verbose logical; whether to print diagnostic messages #' and progress information #' #' @param cast_float32 logical; whether to compute the distances using 32-bit #' instead of 64-bit precision floating-point arithmetic (up to 2x faster) #' #' @param ... further arguments passed to or from other methods #' #' #' @return #' Matrix of class \code{mst} with n-1 rows and 3 columns: #' \code{from}, \code{to} and \code{dist}. It holds \code{from} < \code{to}. #' Moreover, \code{dist} is sorted nondecreasingly. #' The i-th row gives the i-th edge of the MST. #' \code{(from[i], to[i])} defines the vertices (in 1,...,n) #' and \code{dist[i]} gives the weight, i.e., the #' distance between the corresponding points. #' #' The \code{method} attribute gives the name of the distance used. #' The \code{Labels} attribute gives the labels of all the input points. #' #' If \code{M} > 1, the \code{nn} attribute gives the indices of the \code{M}-1 #' nearest neighbours of each point. #' #' #' @examples #' library("datasets") #' data("iris") #' X <- iris[1:4] #' tree <- mst(X) #' #' @rdname mst #' @export mst <- function(d, ...) { UseMethod("mst") } #' @export #' @rdname mst #' @method mst default mst.default <- function(d, distance=c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"), M=1L, cast_float32=TRUE, verbose=FALSE, ...) { distance <- match.arg(distance) d <- as.matrix(d) result <- .mst.default(d, distance, M, cast_float32, verbose) attr(result, "method") <- if (M == 1L) distance else sprintf("mutual reachability distance (%s, M=%d)", distance, M) attr(result, "Labels") <- dimnames(d)[[1]] class(result) <- "mst" result } #' @export #' @rdname mst #' @method mst dist mst.dist <- function(d, M=1L, verbose=FALSE, ...) { result <- .mst.dist(d, M, verbose) attr(result, "method") <- if (M == 1L) attr(d, "method") else sprintf("mutual reachability distance (%s, M=%d)", attr(d, "method"), M) attr(result, "Labels") <- attr(d, "Labels") class(result) <- "mst" result } registerS3method("mst", "default", "mst.default") registerS3method("mst", "dist", "mst.dist") #' @title Euclidean Minimum Spanning Tree #' #' @description #' Provides access to the implementation of the Dual-Tree Boruvka #' algorithm from the \code{mlpack} package (if available). #' It is based on kd-trees and is fast for (very) low-dimensional #' Euclidean spaces. For higher dimensional spaces (say, over 5 features) #' or other metrics, use the parallelised Prim-like algorithm implemented #' in \code{\link{mst}()}. #' #' #' @param X a numeric matrix (or an object coercible to one, #' e.g., a data frame with numeric-like columns) #' #' @param leaf_size size of leaves in the kd-tree, #' controls the trade-off between speed and memory consumption #' #' @param naive logical; whether to use the naive, quadratic-time algorithm #' #' @param verbose logical; whether to print diagnostic messages #' #' #' @return #' An object of class \code{mst}, see \code{\link{mst}()} for details. #' #' @references #' March W.B., Ram P., Gray A.G., #' Fast Euclidean Minimum Spanning Tree: Algorithm, Analysis, and Applications, #' \emph{Proc. ACM SIGKDD'10}, 2010, 603-611, #' \url{https://mlpack.org/papers/emst.pdf}. #' #' Curtin R.R., Edel M., Lozhnikov M., Mentekidis Y., Ghaisas S., Zhang S., #' mlpack 3: A fast, flexible machine learning library, #' \emph{Journal of Open Source Software} 3(26), 2018, 726. #' #' @export emst_mlpack <- function(X, leaf_size=1, naive=FALSE, verbose=FALSE) { X <- as.matrix(X) if (!requireNamespace("mlpack", quietly=TRUE)) { warning("Package `mlpack` is not installed. Using mst() instead.") return(mst.default(X, verbose=verbose, cast_float32=FALSE)) } mst <- mlpack::emst(X, leaf_size=leaf_size, naive=naive, verbose=verbose)$output mst[, 1] <- mst[, 1] + 1 # 0-based -> 1-based indexing mst[, 2] <- mst[, 2] + 1 # 0-based -> 1-based indexing stopifnot(mst[, 1] < mst[, 2]) stopifnot(!is.unsorted(mst[, 3])) structure( mst, class="mst", method="euclidean", Labels=dimnames(X)[[1]] ) } genieclust/NEWS0000644000176200001440000001235514661566502013134 0ustar liggesusers# Changelog ## 1.1.6 (2024-08-22) * [PYTHON] The package now works with *numpy* 2.0. ## 1.1.5 (2023-10-18) * [BACKWARD INCOMPATIBILITY] [Python and R] Inequality measures are no longer referred to as inequity measures. * [BACKWARD INCOMPATIBILITY] [Python and R] Some external cluster validity measures were renamed: `adjusted_asymmetric_accuracy` -> `normalized_clustering_accuracy`, `normalized_accuracy` -> `normalized_pivoted_accuracy`. * [BACKWARD INCOMPATIBILITY] [Python] `compare_partitions2` has been removed, as `compare_partitions` and other partition similarity scores now support both pairs of label vectors `(x, y)` and confusion matrices `(x=C, y=None)`. * [Python and R] New parameter to `pair_sets_index`: `clipped`. * In `normalizing_permutation` and external cluster validity measures, the input matrices can now be of the type `double`. * [BUGFIX] [Python] #80: Fixed adjustment for `nmslib_n_neighbors` in small samples. * [BUGFIX] [Python] #82: `cluster_validity` submodule not imported. * [BUGFIX] Some external cluster validity measures now handle NaNs better and are slightly less prone to round-off errors. ## 1.1.4 (2023-03-31) * [Python] The GIc algorithm is no longer marked as experimental; its description is provided in . ## 1.1.3 (2023-01-17) * [R] `mst.default` now throws an error if any element in the input matrix is missing/infinite. * [Python] The call to `mlpack.emst` that stopped working with the new version of `mlpack` has been fixed. ## 1.1.2 (2022-09-17) * [Python and R] `adjusted_asymmetric_accuracy` now accepts confusion matrices with fewer columns than rows. Such "missing" columns are now treated as if they were filled with 0s. * [Python and R] `pair_sets_index`, and `normalized_accuracy` return the same results for non-symmetric confusion matrices and transposes thereof. ## 1.1.1 (2022-09-15) * [Python] #75: `nmslib` is now optional. * [BUILD TIME]: The use of `ssize_t` was not portable. ## 1.1.0 (2022-09-05) * [Python and R] New function: `adjusted_asymmetric_accuracy`. * [Python and R] Implementations of the so-called internal cluster validity measures discussed in DOI: [10.1016/j.ins.2021.10.004](https://doi.org/10.1016/j.ins.2021.10.004); see our (GitHub-only) [CVI](https://github.com/gagolews/optim_cvi) package for R. In particular, the generalised Dunn indices are based on the code originally authored by Maciej Bartoszuk. Thanks. Functions added (`cluster_validity` module): `calinski_harabasz_index`, `dunnowa_index`, `generalised_dunn_index`, `negated_ball_hall_index`, `negated_davies_bouldin_index`, `negated_wcss_index`, `silhouette_index`, `silhouette_w_index`, `wcnn_index`. These cluster validity measures are discussed in more detail at . * [BACKWARD INCOMPATIBILITY] `normalized_confusion_matrix` now solves the maximal assignment problem instead of applying the somewhat primitive partial pivoting. * [Python and R] New function: `normalizing_permutation` * [R] New function: `normalized_confusion_matrix`. * [Python and R] New parameter to `pair_sets_index`: `simplified`. * [Python] New parameters to `plots.plot_scatter`: `axis`, `title`, `xlabel`, `ylabel`, `xlim`, `ylim`. ## 1.0.1 (2022-08-08) * [GENERAL] A paper on the `genieclust` package is now available: M. Gagolewski, genieclust: Fast and robust hierarchical clustering, SoftwareX 15, 100722, 2021, DOI: [10.1016/j.softx.2021.100722](https://doi.org/10.1016/j.softx.2021.100722). * [Python] `plots.plot_scatter` now uses a more accessible default palette (from R 4.0.0). * [Python and R] New function: `devergottini_index`. ## 1.0.0 (2021-04-22) * [R] Use `mlpack` instead of `RcppMLPACK` (#72). This package is merely suggested, not dependent upon. ## 0.9.8 (2021-01-08) * [Python] Require Python >= 3.7 (implied by `numpy`). * [Python] Require `nmslib`. * [R] Use `RcppMLPACK` directly; remove dependency on `emstreeR`. * [R] Use `tinytest` for unit testing instead of `testthat`. ## 0.9.4 (2020-07-31) * [BUGFIX] [R] Fix build errors on Solaris. ## 0.9.3 (2020-07-25) * [BUGFIX] [Python] Add code coverage CI. Fix some minor inconsistencies. Automate the `bdist` build chain. * [R] Update DESCRIPTION to meet the CRAN policies. ## 0.9.2 (2020-07-22) * [BUGFIX] [Python] Fix broken build script for OS X with no OpenMP. ## 0.9.1 (2020-07-18) * [GENERAL] The package has been completely rewritten. The core functionality is now implemented in C++ (with OpenMP). * [GENERAL] Clustering with respect to HDBSCAN*-like mutual reachability distances is supported. * [GENERAL] The parallelised Jarnik-Prim algorithm now supports on-the-fly distance computations. Euclidean minimum spanning tree can be determined with `mlpack`, which is much faster in low-dimensional spaces. * [R] R version is now available. * [Python] [Experimental] The GIc algorithm proposed by Anna Cena in her 2018 PhD thesis is added. * [Python] Approximate version based on nearest neighbour graphs produced by `nmslib` is added. ## 0.1a2 (2018-05-23) * [Python] Initial PyPI release. genieclust/src/0000755000176200001440000000000014661572147013220 5ustar liggesusersgenieclust/src/cvi.h0000644000176200001440000002647014634041051014144 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_H #define __CVI_H #include #include #include #include #include #include "c_common.h" #include "c_matrix.h" template inline T square(T x) { return x*x; } /** Computes the squared Euclidean distance between two vectors. * * @param x c_contiguous vector of length d * @param y c_contiguous vector of length d * @param d length of both x and y * @return sum((x-y)^2) */ FLOAT_T distance_l2_squared(const FLOAT_T* x, const FLOAT_T* y, size_t d) { FLOAT_T ret = 0.0; for (size_t i=0; id < other.d; } }; /** Computes Euclidean distances between pairs of points in the same dataset. * Results might be precomputed for smaller datasets. */ class EuclideanDistance { private: const CMatrix* X; std::vector D; bool precomputed; bool squared; size_t n; size_t d; public: EuclideanDistance(const CMatrix* _X, bool _precompute=false, bool _square=false) : X(_X), D(_precompute?(_X->nrow()*(_X->nrow()-1)/2):0), precomputed(_precompute), squared(_square), n(_X->nrow()), d(_X->ncol()) { if (!_precompute) return; size_t k = 0; for (size_t i=0; irow(i), _X->row(j), d); } } if (!_square) { for (k=0; k j) std::swap(i, j); //GENIECLUST_ASSERT(i*n - i*(i+1)/2+(j-i-1) >= 0); //GENIECLUST_ASSERT(i*n - i*(i+1)/2+(j-i-1) < D.size()); return D[i*n - i*(i+1)/2 + (j-i-1)]; } else { if (squared) return distance_l2_squared(X->row(i), X->row(j), X->ncol()); else return sqrt(distance_l2_squared(X->row(i), X->row(j), X->ncol())); } } }; /** Base class for all the internal cluster validity indices implemented. */ class ClusterValidityIndex { protected: CMatrix X; ///< data matrix of size n*d std::vector L; ///< current label vector of size n std::vector count; ///< size of each of the K clusters const size_t K; ///< number of clusters, max(L) const size_t n; ///< number of points (for brevity of notation) const size_t d; ///< dataset dimensionality (for brevity) const bool allow_undo; ///< is the object's state preserved on modify()? Py_ssize_t last_i; ///< for undo() Py_ssize_t last_j; ///< for undo() public: /** Constructor * * @param _X dataset * @param _K number of clusters * @param _allow_undo shall the object's state be preserved on a call to * modify()? */ ClusterValidityIndex( const CMatrix& _X, const size_t _K, const bool _allow_undo ) : X(_X), L(_X.nrow()), count(_K), K(_K), n(_X.nrow()), d(_X.ncol()), allow_undo(_allow_undo) { } /** Destructor */ virtual ~ClusterValidityIndex() { } /** Returns the number of elements in the j-th cluster * * @param j * @return */ size_t get_count(const size_t j) const { GENIECLUST_ASSERT(j >= 0 && j < K); return count[j]; } /** Returns the i-th point's cluster label * * @param i * @return */ Py_ssize_t get_label(const size_t i) const { GENIECLUST_ASSERT(i >= 0 && i < n); return L[i]; } /** Returns the label vector * * @return */ const std::vector& get_labels() const { return L; } /** Returns the number of clusters * * @return */ const size_t get_K() const { return K; } /** Returns the number of data points * * @return */ const size_t get_n() const { return n; } /** Assigns a new label vector * * @param _L */ virtual void set_labels(const std::vector& _L) { GENIECLUST_ASSERT(X.nrow() == _L.size()); for (size_t j=0; j= 0 && _L[i] < (Py_ssize_t)K); L[i] = _L[i]; count[ L[i] ]++; } for (size_t j=0; j 0); } } /** Makes the i-th point a member of the j-th cluster * * The inheriting classes can overload this method and * compute the cluster validity index incrementally. * * @param i * @param j */ virtual void modify(size_t i, Py_ssize_t j) { GENIECLUST_ASSERT(i >= 0 && i < n); GENIECLUST_ASSERT(j >= 0 && j < (Py_ssize_t)K); GENIECLUST_ASSERT(L[i] >= 0 && L[i] < (Py_ssize_t)K); GENIECLUST_ASSERT(count[L[i]] > 0); GENIECLUST_ASSERT(L[i] != j); if (allow_undo) { last_i = i; last_j = L[i]; } count[L[i]]--; L[i] = j; count[L[i]]++; } /** Computes the cluster validity index for the current label vector, L */ virtual FLOAT_T compute() = 0; /** Cancels the most recent modify() operation. */ virtual void undo() { GENIECLUST_ASSERT(allow_undo); count[L[last_i]]--; L[last_i] = last_j; count[L[last_i]]++; } }; /** Represents a cluster validity index that is based * on the notion of the clusters' centroid. */ class CentroidsBasedIndex : public ClusterValidityIndex { protected: CMatrix centroids; ///< centroids of all the clusters, size K*d public: // Described in the base class CentroidsBasedIndex( const CMatrix& _X, const size_t _K, const bool _allow_undo) : ClusterValidityIndex(_X, _K, _allow_undo), centroids(K, d) { ; } // Described in the base class virtual void set_labels(const std::vector& _L) { ClusterValidityIndex::set_labels(_L); // sets L and count for (size_t i=0; i=1. */ class NNBasedIndex : public ClusterValidityIndex { protected: const size_t M; ///< number of nearest neighbours CMatrix dist; ///< dist(i, j) is the L2 distance between i and its j-th NN CMatrix ind; ///< ind(i, j) is the index of the j-th NN of i public: // Described in the base class NNBasedIndex( const CMatrix& _X, const size_t _K, const bool _allow_undo, const size_t _M) : ClusterValidityIndex(_X, _K, _allow_undo), M((_M<=n-1)?_M:(n-1)), dist(n, M, INFTY), ind(n, M, n) { GENIECLUST_ASSERT(M>0 && M 0 && dij < dist(i, l-1)) { dist(i, l) = dist(i, l-1); ind(i, l) = ind(i, l-1); l--; } dist(i, l) = dij; ind(i, l) = j; } if (dij < dist(j, M-1)) { // i may be amongst M NNs of j size_t l = M-1; while (l > 0 && dij < dist(j, l-1)) { dist(j, l) = dist(j, l-1); ind(j, l) = ind(j, l-1); l--; } dist(j, l) = dij; ind(j, l) = i; } } } } }; #endif genieclust/src/c_matrix.h0000644000176200001440000000676514634041214015177 0ustar liggesusers/* Lightweight matrix class - KISS * * Copyleft (C) 2018-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_matrix_h #define __c_matrix_h #include /** * Represents a matrix as a C-contiguous array, * i.e., in a row-major order. */ template class CMatrix { private: size_t n, d; std::vector elems; public: /** Initialises a new matrix of size _nrow*_ncol, filled with 0s * * @param _nrow * @param _ncol */ CMatrix(size_t _nrow, size_t _ncol) : n(_nrow), d(_ncol), elems(_nrow*_ncol) { ; } /** Initialises a new matrix of size _nrow*_ncol, filled with _ts * * @param _nrow * @param _ncol * @param _t */ CMatrix(size_t _nrow, size_t _ncol, T _t) : n(_nrow), d(_ncol), elems(_nrow*_ncol, _t) { ; } /** Initialises a new matrix of size _nrow*_ncol based on a contiguous * C- or Fortran-style array * * @param _data * @param _nrow * @param _ncol * @param _c_order whether the first _ncol elements in _data constitute the first row * or the first _nrow elements define the first column */ template CMatrix(const S* _data, size_t _nrow, size_t _ncol, bool _c_order) : n(_nrow), d(_ncol), elems(_nrow*_ncol) { if (_c_order) { for (size_t i=0; i<_nrow*_ncol; ++i) elems[i] = (T)(_data[i]); } else { size_t k = 0; for (size_t i=0; i<_nrow; i++) { for (size_t j=0; j<_ncol; j++) { elems[k++] = (T)_data[i+_nrow*j]; } } } } /** Read/write access to an element in the i-th row and the j-th column * * @param i * @param j * @return a reference to the indicated matrix element */ T& operator()(const size_t i, const size_t j) { return elems[d*i + j]; } const T& operator()(const size_t i, const size_t j) const { return elems[d*i + j]; } /** Returns a direct pointer to the underlying C-contiguous data array: * the first ncol elements give the 1st row, * the next ncol element give the 2nd row, * and so forth. * * @return pointer */ T* data() { return elems.data(); } const T* data() const { return elems.data(); } /** Returns a direct pointer to the start of the i-th row * * @param i * @return pointer */ T* row(const size_t i) { return elems.data()+i*d; } const T* row(const size_t i) const { return elems.data()+i*d; } /** Returns the number of rows * * @return */ size_t nrow() const { return n; } /** Returns the number of columns * * @return */ size_t ncol() const { return d; } }; #endif genieclust/src/RcppExports.cpp0000644000176200001440000004311014650457135016211 0ustar liggesusers// Generated by using Rcpp::compileAttributes() -> do not edit by hand // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 #include using namespace Rcpp; #ifdef RCPP_USE_GLOBAL_ROSTREAM Rcpp::Rostream& Rcpp::Rcout = Rcpp::Rcpp_cout_get(); Rcpp::Rostream& Rcpp::Rcerr = Rcpp::Rcpp_cerr_get(); #endif // normalized_clustering_accuracy double normalized_clustering_accuracy(RObject x, RObject y); RcppExport SEXP _genieclust_normalized_clustering_accuracy(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(normalized_clustering_accuracy(x, y)); return rcpp_result_gen; END_RCPP } // normalized_pivoted_accuracy double normalized_pivoted_accuracy(RObject x, RObject y); RcppExport SEXP _genieclust_normalized_pivoted_accuracy(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(normalized_pivoted_accuracy(x, y)); return rcpp_result_gen; END_RCPP } // pair_sets_index double pair_sets_index(RObject x, RObject y, bool simplified, bool clipped); RcppExport SEXP _genieclust_pair_sets_index(SEXP xSEXP, SEXP ySEXP, SEXP simplifiedSEXP, SEXP clippedSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); Rcpp::traits::input_parameter< bool >::type simplified(simplifiedSEXP); Rcpp::traits::input_parameter< bool >::type clipped(clippedSEXP); rcpp_result_gen = Rcpp::wrap(pair_sets_index(x, y, simplified, clipped)); return rcpp_result_gen; END_RCPP } // adjusted_rand_score double adjusted_rand_score(RObject x, RObject y, bool clipped); RcppExport SEXP _genieclust_adjusted_rand_score(SEXP xSEXP, SEXP ySEXP, SEXP clippedSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); Rcpp::traits::input_parameter< bool >::type clipped(clippedSEXP); rcpp_result_gen = Rcpp::wrap(adjusted_rand_score(x, y, clipped)); return rcpp_result_gen; END_RCPP } // rand_score double rand_score(RObject x, RObject y); RcppExport SEXP _genieclust_rand_score(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(rand_score(x, y)); return rcpp_result_gen; END_RCPP } // adjusted_fm_score double adjusted_fm_score(RObject x, RObject y, bool clipped); RcppExport SEXP _genieclust_adjusted_fm_score(SEXP xSEXP, SEXP ySEXP, SEXP clippedSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); Rcpp::traits::input_parameter< bool >::type clipped(clippedSEXP); rcpp_result_gen = Rcpp::wrap(adjusted_fm_score(x, y, clipped)); return rcpp_result_gen; END_RCPP } // fm_score double fm_score(RObject x, RObject y); RcppExport SEXP _genieclust_fm_score(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(fm_score(x, y)); return rcpp_result_gen; END_RCPP } // mi_score double mi_score(RObject x, RObject y); RcppExport SEXP _genieclust_mi_score(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(mi_score(x, y)); return rcpp_result_gen; END_RCPP } // normalized_mi_score double normalized_mi_score(RObject x, RObject y); RcppExport SEXP _genieclust_normalized_mi_score(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(normalized_mi_score(x, y)); return rcpp_result_gen; END_RCPP } // adjusted_mi_score double adjusted_mi_score(RObject x, RObject y, bool clipped); RcppExport SEXP _genieclust_adjusted_mi_score(SEXP xSEXP, SEXP ySEXP, SEXP clippedSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); Rcpp::traits::input_parameter< bool >::type clipped(clippedSEXP); rcpp_result_gen = Rcpp::wrap(adjusted_mi_score(x, y, clipped)); return rcpp_result_gen; END_RCPP } // normalized_confusion_matrix NumericMatrix normalized_confusion_matrix(RObject x, RObject y); RcppExport SEXP _genieclust_normalized_confusion_matrix(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(normalized_confusion_matrix(x, y)); return rcpp_result_gen; END_RCPP } // normalizing_permutation IntegerVector normalizing_permutation(RObject x, RObject y); RcppExport SEXP _genieclust_normalizing_permutation(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(normalizing_permutation(x, y)); return rcpp_result_gen; END_RCPP } // calinski_harabasz_index double calinski_harabasz_index(NumericMatrix X, NumericVector y); RcppExport SEXP _genieclust_calinski_harabasz_index(SEXP XSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type X(XSEXP); Rcpp::traits::input_parameter< NumericVector >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(calinski_harabasz_index(X, y)); return rcpp_result_gen; END_RCPP } // dunnowa_index double dunnowa_index(NumericMatrix X, NumericVector y, int M, Rcpp::String owa_numerator, Rcpp::String owa_denominator); RcppExport SEXP _genieclust_dunnowa_index(SEXP XSEXP, SEXP ySEXP, SEXP MSEXP, SEXP owa_numeratorSEXP, SEXP owa_denominatorSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type X(XSEXP); Rcpp::traits::input_parameter< NumericVector >::type y(ySEXP); Rcpp::traits::input_parameter< int >::type M(MSEXP); Rcpp::traits::input_parameter< Rcpp::String >::type owa_numerator(owa_numeratorSEXP); Rcpp::traits::input_parameter< Rcpp::String >::type owa_denominator(owa_denominatorSEXP); rcpp_result_gen = Rcpp::wrap(dunnowa_index(X, y, M, owa_numerator, owa_denominator)); return rcpp_result_gen; END_RCPP } // generalised_dunn_index double generalised_dunn_index(NumericMatrix X, NumericVector y, int lowercase_d, int uppercase_d); RcppExport SEXP _genieclust_generalised_dunn_index(SEXP XSEXP, SEXP ySEXP, SEXP lowercase_dSEXP, SEXP uppercase_dSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type X(XSEXP); Rcpp::traits::input_parameter< NumericVector >::type y(ySEXP); Rcpp::traits::input_parameter< int >::type lowercase_d(lowercase_dSEXP); Rcpp::traits::input_parameter< int >::type uppercase_d(uppercase_dSEXP); rcpp_result_gen = Rcpp::wrap(generalised_dunn_index(X, y, lowercase_d, uppercase_d)); return rcpp_result_gen; END_RCPP } // negated_ball_hall_index double negated_ball_hall_index(NumericMatrix X, NumericVector y); RcppExport SEXP _genieclust_negated_ball_hall_index(SEXP XSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type X(XSEXP); Rcpp::traits::input_parameter< NumericVector >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(negated_ball_hall_index(X, y)); return rcpp_result_gen; END_RCPP } // negated_davies_bouldin_index double negated_davies_bouldin_index(NumericMatrix X, NumericVector y); RcppExport SEXP _genieclust_negated_davies_bouldin_index(SEXP XSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type X(XSEXP); Rcpp::traits::input_parameter< NumericVector >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(negated_davies_bouldin_index(X, y)); return rcpp_result_gen; END_RCPP } // negated_wcss_index double negated_wcss_index(NumericMatrix X, NumericVector y); RcppExport SEXP _genieclust_negated_wcss_index(SEXP XSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type X(XSEXP); Rcpp::traits::input_parameter< NumericVector >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(negated_wcss_index(X, y)); return rcpp_result_gen; END_RCPP } // silhouette_index double silhouette_index(NumericMatrix X, NumericVector y); RcppExport SEXP _genieclust_silhouette_index(SEXP XSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type X(XSEXP); Rcpp::traits::input_parameter< NumericVector >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(silhouette_index(X, y)); return rcpp_result_gen; END_RCPP } // silhouette_w_index double silhouette_w_index(NumericMatrix X, NumericVector y); RcppExport SEXP _genieclust_silhouette_w_index(SEXP XSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type X(XSEXP); Rcpp::traits::input_parameter< NumericVector >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(silhouette_w_index(X, y)); return rcpp_result_gen; END_RCPP } // wcnn_index double wcnn_index(NumericMatrix X, NumericVector y, int M); RcppExport SEXP _genieclust_wcnn_index(SEXP XSEXP, SEXP ySEXP, SEXP MSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type X(XSEXP); Rcpp::traits::input_parameter< NumericVector >::type y(ySEXP); Rcpp::traits::input_parameter< int >::type M(MSEXP); rcpp_result_gen = Rcpp::wrap(wcnn_index(X, y, M)); return rcpp_result_gen; END_RCPP } // dot_mst_default NumericMatrix dot_mst_default(NumericMatrix X, String distance, int M, bool cast_float32, bool verbose); RcppExport SEXP _genieclust_dot_mst_default(SEXP XSEXP, SEXP distanceSEXP, SEXP MSEXP, SEXP cast_float32SEXP, SEXP verboseSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type X(XSEXP); Rcpp::traits::input_parameter< String >::type distance(distanceSEXP); Rcpp::traits::input_parameter< int >::type M(MSEXP); Rcpp::traits::input_parameter< bool >::type cast_float32(cast_float32SEXP); Rcpp::traits::input_parameter< bool >::type verbose(verboseSEXP); rcpp_result_gen = Rcpp::wrap(dot_mst_default(X, distance, M, cast_float32, verbose)); return rcpp_result_gen; END_RCPP } // dot_mst_dist NumericMatrix dot_mst_dist(NumericVector d, int M, bool verbose); RcppExport SEXP _genieclust_dot_mst_dist(SEXP dSEXP, SEXP MSEXP, SEXP verboseSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericVector >::type d(dSEXP); Rcpp::traits::input_parameter< int >::type M(MSEXP); Rcpp::traits::input_parameter< bool >::type verbose(verboseSEXP); rcpp_result_gen = Rcpp::wrap(dot_mst_dist(d, M, verbose)); return rcpp_result_gen; END_RCPP } // dot_genie IntegerVector dot_genie(NumericMatrix mst, int k, double gini_threshold, String postprocess, bool detect_noise, bool verbose); RcppExport SEXP _genieclust_dot_genie(SEXP mstSEXP, SEXP kSEXP, SEXP gini_thresholdSEXP, SEXP postprocessSEXP, SEXP detect_noiseSEXP, SEXP verboseSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type mst(mstSEXP); Rcpp::traits::input_parameter< int >::type k(kSEXP); Rcpp::traits::input_parameter< double >::type gini_threshold(gini_thresholdSEXP); Rcpp::traits::input_parameter< String >::type postprocess(postprocessSEXP); Rcpp::traits::input_parameter< bool >::type detect_noise(detect_noiseSEXP); Rcpp::traits::input_parameter< bool >::type verbose(verboseSEXP); rcpp_result_gen = Rcpp::wrap(dot_genie(mst, k, gini_threshold, postprocess, detect_noise, verbose)); return rcpp_result_gen; END_RCPP } // dot_gclust List dot_gclust(NumericMatrix mst, double gini_threshold, bool verbose); RcppExport SEXP _genieclust_dot_gclust(SEXP mstSEXP, SEXP gini_thresholdSEXP, SEXP verboseSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type mst(mstSEXP); Rcpp::traits::input_parameter< double >::type gini_threshold(gini_thresholdSEXP); Rcpp::traits::input_parameter< bool >::type verbose(verboseSEXP); rcpp_result_gen = Rcpp::wrap(dot_gclust(mst, gini_threshold, verbose)); return rcpp_result_gen; END_RCPP } // gini_index double gini_index(Rcpp::NumericVector x); RcppExport SEXP _genieclust_gini_index(SEXP xSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::NumericVector >::type x(xSEXP); rcpp_result_gen = Rcpp::wrap(gini_index(x)); return rcpp_result_gen; END_RCPP } // bonferroni_index double bonferroni_index(Rcpp::NumericVector x); RcppExport SEXP _genieclust_bonferroni_index(SEXP xSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::NumericVector >::type x(xSEXP); rcpp_result_gen = Rcpp::wrap(bonferroni_index(x)); return rcpp_result_gen; END_RCPP } // devergottini_index double devergottini_index(Rcpp::NumericVector x); RcppExport SEXP _genieclust_devergottini_index(SEXP xSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::NumericVector >::type x(xSEXP); rcpp_result_gen = Rcpp::wrap(devergottini_index(x)); return rcpp_result_gen; END_RCPP } static const R_CallMethodDef CallEntries[] = { {"_genieclust_normalized_clustering_accuracy", (DL_FUNC) &_genieclust_normalized_clustering_accuracy, 2}, {"_genieclust_normalized_pivoted_accuracy", (DL_FUNC) &_genieclust_normalized_pivoted_accuracy, 2}, {"_genieclust_pair_sets_index", (DL_FUNC) &_genieclust_pair_sets_index, 4}, {"_genieclust_adjusted_rand_score", (DL_FUNC) &_genieclust_adjusted_rand_score, 3}, {"_genieclust_rand_score", (DL_FUNC) &_genieclust_rand_score, 2}, {"_genieclust_adjusted_fm_score", (DL_FUNC) &_genieclust_adjusted_fm_score, 3}, {"_genieclust_fm_score", (DL_FUNC) &_genieclust_fm_score, 2}, {"_genieclust_mi_score", (DL_FUNC) &_genieclust_mi_score, 2}, {"_genieclust_normalized_mi_score", (DL_FUNC) &_genieclust_normalized_mi_score, 2}, {"_genieclust_adjusted_mi_score", (DL_FUNC) &_genieclust_adjusted_mi_score, 3}, {"_genieclust_normalized_confusion_matrix", (DL_FUNC) &_genieclust_normalized_confusion_matrix, 2}, {"_genieclust_normalizing_permutation", (DL_FUNC) &_genieclust_normalizing_permutation, 2}, {"_genieclust_calinski_harabasz_index", (DL_FUNC) &_genieclust_calinski_harabasz_index, 2}, {"_genieclust_dunnowa_index", (DL_FUNC) &_genieclust_dunnowa_index, 5}, {"_genieclust_generalised_dunn_index", (DL_FUNC) &_genieclust_generalised_dunn_index, 4}, {"_genieclust_negated_ball_hall_index", (DL_FUNC) &_genieclust_negated_ball_hall_index, 2}, {"_genieclust_negated_davies_bouldin_index", (DL_FUNC) &_genieclust_negated_davies_bouldin_index, 2}, {"_genieclust_negated_wcss_index", (DL_FUNC) &_genieclust_negated_wcss_index, 2}, {"_genieclust_silhouette_index", (DL_FUNC) &_genieclust_silhouette_index, 2}, {"_genieclust_silhouette_w_index", (DL_FUNC) &_genieclust_silhouette_w_index, 2}, {"_genieclust_wcnn_index", (DL_FUNC) &_genieclust_wcnn_index, 3}, {"_genieclust_dot_mst_default", (DL_FUNC) &_genieclust_dot_mst_default, 5}, {"_genieclust_dot_mst_dist", (DL_FUNC) &_genieclust_dot_mst_dist, 3}, {"_genieclust_dot_genie", (DL_FUNC) &_genieclust_dot_genie, 6}, {"_genieclust_dot_gclust", (DL_FUNC) &_genieclust_dot_gclust, 3}, {"_genieclust_gini_index", (DL_FUNC) &_genieclust_gini_index, 1}, {"_genieclust_bonferroni_index", (DL_FUNC) &_genieclust_bonferroni_index, 1}, {"_genieclust_devergottini_index", (DL_FUNC) &_genieclust_devergottini_index, 1}, {NULL, NULL, 0} }; RcppExport void R_init_genieclust(DllInfo *dll) { R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); R_useDynamicSymbols(dll, FALSE); } genieclust/src/c_common.h0000644000176200001440000000441014634041214015144 0ustar liggesusers/* Common functions, macros, includes * * Copyleft (C) 2018-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_common_h #define __c_common_h #ifdef Py_PYTHON_H #define GENIECLUST_PYTHON 1 #endif #ifdef GENIECLUST_R #undef GENIECLUST_R #define GENIECLUST_R 1 #endif #include #include #include #ifdef _OPENMP #include #define OPENMP_ENABLED 1 #else #define OPENMP_ENABLED 0 #endif #ifndef GENIECLUST_ASSERT #define __GENIECLUST_STR(x) #x #define GENIECLUST_STR(x) __GENIECLUST_STR(x) #define GENIECLUST_ASSERT(EXPR) { if (!(EXPR)) \ throw std::runtime_error( "genieclust: Assertion " #EXPR " failed in "\ __FILE__ ":" GENIECLUST_STR(__LINE__) ); } #endif #if GENIECLUST_R #include #else #include #endif #if GENIECLUST_R #define GENIECLUST_PRINT(fmt) REprintf((fmt)); #else #define GENIECLUST_PRINT(fmt) fprintf(stderr, (fmt)); #endif #if GENIECLUST_R #define GENIECLUST_PRINT_int(fmt, val) REprintf((fmt), (int)(val)); #else #define GENIECLUST_PRINT_int(fmt, val) fprintf(stderr, (fmt), (int)(val)); #endif #if GENIECLUST_R #define GENIECLUST_PRINT_float(fmt, val) REprintf((fmt), (double)(val)); #else #define GENIECLUST_PRINT_float(fmt, val) fprintf(stderr, (fmt), (double)(val)); #endif #if GENIECLUST_R typedef ssize_t Py_ssize_t; #endif typedef double FLOAT_T; ///< float type we are working internally with #ifndef INFTY #define INFTY (std::numeric_limits::infinity()) #endif #define IS_PLUS_INFTY(x) ((x) > 0.0 && !std::isfinite(x)) #define IS_MINUS_INFTY(x) ((x) < 0.0 && !std::isfinite(x)) #define CVI_MAX_N_PRECOMPUTE_DISTANCE 10000 #endif genieclust/src/c_mst.h0000644000176200001440000004216614634041214014471 0ustar liggesusers/* Minimum Spanning Tree Algorithms: * a. Prim-Jarnik's for Complete Undirected Graphs, * b. Kruskal's for k-NN graphs. * * Copyleft (C) 2018-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_mst_h #define __c_mst_h #include "c_common.h" #include #include // #include // #include #include #include "c_argfuns.h" #include "c_disjoint_sets.h" #include "c_distance.h" #ifdef _OPENMP void Comp_set_num_threads(Py_ssize_t n_threads) { if (n_threads <= 0) n_threads = omp_get_max_threads(); omp_set_num_threads(n_threads); } #else void Comp_set_num_threads(Py_ssize_t /*n_threads*/) { ; } #endif /*! Represents an undirected edge in a weighted graph. * Main purpose: a comparer used to sort MST edges w.r.t. increasing weights */ template class CMstTriple { public: Py_ssize_t i1; //!< first vertex defining an edge Py_ssize_t i2; //!< second vertex defining an edge T d; //!< edge weight CMstTriple() {} CMstTriple(Py_ssize_t i1, Py_ssize_t i2, T d, bool order=true) { this->d = d; if (!order || (i1 < i2)) { this->i1 = i1; this->i2 = i2; } else { this->i1 = i2; this->i2 = i1; } } bool operator<(const CMstTriple& other) const { if (d == other.d) { if (i1 == other.i1) return i2 < other.i2; else return i1 < other.i1; } else return d < other.d; } }; /*! Computes a minimum spanning forest of a near-neighbour * graph using Kruskal's algorithm, and orders * its edges w.r.t. increasing weights. * * It is assumed that each point can have a different number of nearest * neighbours determined and hence the input graph is given in a "list"-like * form. * * Note that, in general, an MST of the (<=k)-nearest neighbour graph * might not be equal to the MST of the complete Pairwise Distances Graph. * * * @param nns [in/out] a c_contiguous array, shape (c,), of CMstTriple elements * defining the near-neighbour graphs. Loops are ignored. * The array is sorted in-place. * @param c number of elements in `nns`. * @param n number of nodes in the graph * @param mst_dist [out] c_contiguous vector of length n-1, gives weights of the * resulting MST edges in nondecreasing order; * refer to the function's return value for the actual number * of edges generated (if this is < n-1, the object is padded with INFTY) * @param mst_ind [out] c_contiguous matrix of size (n-1)*2, defining the edges * corresponding to mst_d, with mst_i[j,0] <= mst_i[j,1] for all j; * refer to the function's return value for the actual number * of edges generated (if this is < n-1, the object is padded with -1) * @param verbose output diagnostic/progress messages? * * @return number of edges in the minimal spanning forest */ // template // Py_ssize_t Cmst_from_nn_list(CMstTriple* nns, Py_ssize_t c, // Py_ssize_t n, T* mst_dist, Py_ssize_t* mst_ind, bool verbose=false) // { // if (n <= 0) throw std::domain_error("n <= 0"); // if (c <= 0) throw std::domain_error("c <= 0"); // // if (verbose) // GENIECLUST_PRINT_int("[genieclust] Computing the MST... %3d%%", 0); // // std::sort(nns, nns+c); // unstable sort (do we need stable here?) // // Py_ssize_t triple_cur = 0; // Py_ssize_t mst_edge_cur = 0; // // CDisjointSets ds(n); // while (mst_edge_cur < n-1) { // if (triple_cur == c) { // // The input graph is not connected (we have a forest) // Py_ssize_t ret = mst_edge_cur; // while (mst_edge_cur < n-1) { // mst_ind[2*mst_edge_cur+0] = -1; // mst_ind[2*mst_edge_cur+1] = -1; // mst_dist[mst_edge_cur] = INFTY; // mst_edge_cur++; // } // if (verbose) // GENIECLUST_PRINT_int("\b\b\b\b%3d%%", mst_edge_cur*100/(n-1)); // return ret; // } // // Py_ssize_t u = nns[triple_cur].i1; // Py_ssize_t v = nns[triple_cur].i2; // T d = nns[triple_cur].d; // triple_cur++; // // if (u > v) std::swap(u, v); // assure u < v // if (u < 0 || ds.find(u) == ds.find(v)) // continue; // // mst_ind[2*mst_edge_cur+0] = u; // mst_ind[2*mst_edge_cur+1] = v; // mst_dist[mst_edge_cur] = d; // // GENIECLUST_ASSERT(mst_edge_cur == 0 || mst_dist[mst_edge_cur] >= mst_dist[mst_edge_cur-1]); // // ds.merge(u, v); // mst_edge_cur++; // // // if (verbose) // GENIECLUST_PRINT_int("\b\b\b\b%3d%%", mst_edge_cur*100/(n-1)); // // #if GENIECLUST_R // Rcpp::checkUserInterrupt(); // #elif GENIECLUST_PYTHON // if (PyErr_CheckSignals() != 0) throw std::runtime_error("signal caught"); // #endif // } // // if (verbose) GENIECLUST_PRINT("\b\b\b\bdone.\n"); // // return mst_edge_cur; // } /*! Computes a minimum spanning forest of a (<=k)-nearest neighbour * (i.e., one that consists of 1-, 2-, ..., k-neighbours = the first k * nearest neighbours) graph using Kruskal's algorithm, and orders * its edges w.r.t. increasing weights. * * Note that, in general, an MST of the (<=k)-nearest neighbour graph * might not be equal to the MST of the complete Pairwise Distances Graph. * * It is assumed that each query point is not its own neighbour. * * @param dist a c_contiguous array, shape (n,k), * dist[i,j] gives the weight of the (undirected) edge {i, ind[i,j]} * @param ind a c_contiguous array, shape (n,k), * (undirected) edge definition, interpreted as {i, ind[i,j]}; * negative indices as well as those such that ind[i,j]==i are ignored * @param d_core "core" distance (or NULL); * if not NULL then the distance between 2 points will be * d(i, ind[i,j]) = max(d(i, ind[i,j]), d_core[i], d_core[ind[i,j]]) * @param n number of nodes * @param k minimal degree of all the nodes * @param mst_dist [out] c_contiguous vector of length n-1, gives weights of the * resulting MST edges in nondecreasing order; * refer to the function's return value for the actual number * of edges generated (if this is < n-1, the object is padded with INFTY) * @param mst_ind [out] c_contiguous matrix of size (n-1)*2, defining the edges * corresponding to mst_d, with mst_i[j,0] <= mst_i[j,1] for all j; * refer to the function's return value for the actual number * of edges generated (if this is < n-1, the object is padded with -1) * @param maybe_inexact [out] true indicates that k should be increased to * guarantee that the resulting tree would be the same if a complete * pairwise distance graph was given. * @param verbose output diagnostic/progress messages? * * @return number of edges in the minimal spanning forest */ template Py_ssize_t Cmst_from_nn( const T* dist, const Py_ssize_t* ind, const T* d_core, Py_ssize_t n, Py_ssize_t k, T* mst_dist, Py_ssize_t* mst_ind, bool* maybe_inexact, bool verbose=false) { if (n <= 0) throw std::domain_error("n <= 0"); if (k <= 0) throw std::domain_error("k <= 0"); if (k >= n) throw std::domain_error("k >= n"); Py_ssize_t nk = n*k; if (verbose) GENIECLUST_PRINT_int("[genieclust] Computing the MST... %3d%%", 0); std::vector< CMstTriple > nns(nk); Py_ssize_t c = 0; for (Py_ssize_t i = 0; i < n; ++i) { for (Py_ssize_t j = 0; j < k; ++j) { Py_ssize_t i2 = ind[k*i+j]; if (i2 >= 0 && i2 != i) { double d = dist[k*i+j]; if (d_core) { // d(i, i2) = max(d(i,i2), d_core[i], d_core[i2]) if (d < d_core[i]) d = d_core[i]; if (d < d_core[i2]) d = d_core[i2]; } nns[c++] = CMstTriple(i, i2, d, true); } } } std::stable_sort(nns.data(), nns.data()+c); Py_ssize_t triple_cur = 0; Py_ssize_t mst_edge_cur = 0; CDisjointSets ds(n); while (mst_edge_cur < n-1) { if (triple_cur == c) { // The input graph is not connected (we have a forest) Py_ssize_t ret = mst_edge_cur; while (mst_edge_cur < n-1) { mst_ind[2*mst_edge_cur+0] = -1; mst_ind[2*mst_edge_cur+1] = -1; mst_dist[mst_edge_cur] = INFTY; mst_edge_cur++; } if (verbose) GENIECLUST_PRINT_int("\b\b\b\b%3d%%", mst_edge_cur*100/(n-1)); return ret; } Py_ssize_t u = nns[triple_cur].i1; Py_ssize_t v = nns[triple_cur].i2; T d = nns[triple_cur].d; triple_cur++; if (ds.find(u) == ds.find(v)) continue; mst_ind[2*mst_edge_cur+0] = u; mst_ind[2*mst_edge_cur+1] = v; mst_dist[mst_edge_cur] = d; GENIECLUST_ASSERT(mst_edge_cur == 0 || mst_dist[mst_edge_cur] >= mst_dist[mst_edge_cur-1]); ds.merge(u, v); mst_edge_cur++; if (verbose) GENIECLUST_PRINT_int("\b\b\b\b%3d%%", mst_edge_cur*100/(n-1)); #if GENIECLUST_R Rcpp::checkUserInterrupt(); #elif GENIECLUST_PYTHON if (PyErr_CheckSignals() != 0) throw std::runtime_error("signal caught"); #endif } if (verbose) GENIECLUST_PRINT("\b\b\b\bdone.\n"); return mst_edge_cur; } /*! Determine the first k nearest neighbours of each point. * * Exactly n*(n-1)/2 distance computations are performed. * * It is assumed that each query point is not its own neighbour. * * Worst-case time complexity: O(n*(n-1)/2*d*k) * * * @param D a callable CDistance object such that a call to * D(j, M, Py_ssize_t l) returns an n-ary array * with the distances from the j-th point to l points whose indices * are given in array M * @param n number of points * @param k number of nearest neighbours, * @param dist [out] a c_contiguous array, shape (n,k), * dist[i,j] gives the weight of the (undirected) edge {i, ind[i,j]} * @param ind [out] a c_contiguous array, shape (n,k), * (undirected) edge definition, interpreted as {i, ind[i,j]} * @param verbose output diagnostic/progress messages? */ template void Cknn_from_complete(CDistance* D, Py_ssize_t n, Py_ssize_t k, T* dist, Py_ssize_t* ind, bool verbose=false) { if (n <= 0) throw std::domain_error("n <= 0"); if (k <= 0) throw std::domain_error("k <= 0"); if (k >= n) throw std::domain_error("k >= n"); if (verbose) GENIECLUST_PRINT_int("[genieclust] Computing the K-nn graph... %3d%%", 0); for (Py_ssize_t i=0; i M(n); for (Py_ssize_t i=0; i 0 && dij[j] < dist[i*k+l-1]) { dist[i*k+l] = dist[i*k+l-1]; ind[i*k+l] = ind[i*k+l-1]; l -= 1; } dist[i*k+l] = dij[j]; ind[i*k+l] = j; } if (dij[j] < dist[j*k+k-1]) { // i might be amongst k-NNs of j Py_ssize_t l = k-1; while (l > 0 && dij[j] < dist[j*k+l-1]) { dist[j*k+l] = dist[j*k+l-1]; ind[j*k+l] = ind[j*k+l-1]; l -= 1; } dist[j*k+l] = dij[j]; ind[j*k+l] = i; } } if (verbose) GENIECLUST_PRINT_int("\b\b\b\b%3d%%", (n-1+n-i-1)*(i+1)*100/n/(n-1)); #if GENIECLUST_R Rcpp::checkUserInterrupt(); #elif GENIECLUST_PYTHON if (PyErr_CheckSignals() != 0) throw std::runtime_error("signal caught"); #endif } if (verbose) GENIECLUST_PRINT("\b\b\b\bdone.\n"); } /*! A Jarnik (Prim/Dijkstra)-like algorithm for determining * a(*) minimum spanning tree (MST) of a complete undirected graph * with weights given by, e.g., a symmetric n*n matrix. * * However, the distances can be computed on the fly, so that O(n) memory is used. * * (*) Note that there might be multiple minimum trees spanning a given graph. * * * References: * ---------- * * M. Gagolewski, M. Bartoszuk, A. Cena, * Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, * Information Sciences 363 (2016) 8-23. * * V. Jarnik, O jistem problemu minimalnim, * Prace Moravske Prirodovedecke Spolecnosti 6 (1930) 57-63. * * C.F. Olson, Parallel algorithms for hierarchical clustering, * Parallel Comput. 21 (1995) 1313-1325. * * R. Prim, Shortest connection networks and some generalisations, * Bell Syst. Tech. J. 36 (1957) 1389-1401. * * * @param D a callable CDistance object such that a call to * D(j, M, Py_ssize_t k) returns an n-ary array * with the distances from the j-th point to k points whose indices * are given in array M * @param n number of points * @param mst_d [out] vector of length n-1, gives weights of the * resulting MST edges in nondecreasing order * @param mst_i [out] vector of length 2*(n-1), representing * a c_contiguous array of shape (n-1,2), defining the edges * corresponding to mst_d, with mst_i[j,0] < mst_i[j,1] for all j * @param verbose output diagnostic/progress messages? */ template void Cmst_from_complete(CDistance* D, Py_ssize_t n, T* mst_dist, Py_ssize_t* mst_ind, bool verbose=false) { std::vector Dnn(n, INFTY); std::vector Fnn(n); std::vector M(n); std::vector< CMstTriple > res(n-1); for (Py_ssize_t i=0; i 0); GENIECLUST_ASSERT(Fnn[bestj] != bestj); lastj = bestj; // start from bestj next time // M[bestjpos] = M[n-i-1]; // don't visit bestj again // (#62) new version: keep M sorted (more CPU cache-friendly): for (Py_ssize_t j=bestjpos; j(Fnn[bestj], bestj, Dnn[bestj], true); if (verbose) GENIECLUST_PRINT_int("\b\b\b\b%3d%%", (n-1+n-i-1)*(i+1)*100/n/(n-1)); #if GENIECLUST_R Rcpp::checkUserInterrupt(); #elif GENIECLUST_PYTHON if (PyErr_CheckSignals() != 0) throw std::runtime_error("signal caught"); #endif } // sort the resulting MST edges in increasing order w.r.t. d std::sort(res.begin(), res.end()); for (Py_ssize_t i=0; i, * see https://doi.org/10.1016/j.ins.2021.10.004. * * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_WCNN_H #define __CVI_WCNN_H #include "cvi.h" /** Within-Cluster Nearest-Neighbours * * For given M, returns the overall proportion of * each point's M nearest neighbours belonging to the same cluster. * * If there is a cluster of size <= M, the index is -INFTY. * * See the following paper for the formula and further discussion: * M. Gagolewski, M. Bartoszuk, A. Cena, * Are cluster validity measures (in)valid?, Information Sciences 581, * 620-636, 2021, DOI:10.1016/j.ins.2021.10.004 */ class WCNNIndex : public NNBasedIndex { public: // Described in the base class WCNNIndex( const CMatrix& _X, const size_t _K, const bool _allow_undo=false, const size_t _M=10 ) : NNBasedIndex(_X, _K, _allow_undo, _M) { ; } virtual FLOAT_T compute() { for (size_t i=0; i, * see https://doi.org/10.1016/j.ins.2021.10.004. * * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_cvi_h #define __c_cvi_h #include #include #include #include #include "c_common.h" #include "c_matrix.h" #include "cvi.h" #include "cvi_calinski_harabasz.h" #include "cvi_davies_bouldin.h" #include "cvi_silhouette.h" // #include "cvi_dunn.h" // #include "cvi_gamma.h" #include "cvi_wcss.h" #include "cvi_wcnn.h" #include "cvi_dunnowa.h" #include "cvi_generalized_dunn.h" #include "cvi_generalized_dunn_lowercase_d1.h" #include "cvi_generalized_dunn_lowercase_d2.h" #include "cvi_generalized_dunn_lowercase_d3.h" #include "cvi_generalized_dunn_lowercase_d4.h" #include "cvi_generalized_dunn_lowercase_d5.h" #include "cvi_generalized_dunn_lowercase_d6.h" #include "cvi_generalized_dunn_uppercase_d1.h" #include "cvi_generalized_dunn_uppercase_d2.h" #include "cvi_generalized_dunn_uppercase_d3.h" double c_calinski_harabasz_index(const double* X, const Py_ssize_t* y, size_t n, size_t d, Py_ssize_t K) { CalinskiHarabaszIndex ind( CMatrix(X, n, d, /*_c_order=*/true), (Py_ssize_t)K ); ind.set_labels(std::vector(y, y+n)); return (double)ind.compute(); } double c_negated_ball_hall_index(const double* X, const Py_ssize_t* y, size_t n, size_t d, Py_ssize_t K) { WCSSIndex ind( CMatrix(X, n, d, /*_c_order=*/true), (Py_ssize_t)K, false, true/*weighted*/ ); ind.set_labels(std::vector(y, y+n)); return (double)ind.compute(); } double c_negated_davies_bouldin_index(const double* X, const Py_ssize_t* y, size_t n, size_t d, Py_ssize_t K) { DaviesBouldinIndex ind( CMatrix(X, n, d, /*_c_order=*/true), (Py_ssize_t)K ); ind.set_labels(std::vector(y, y+n)); return (double)ind.compute(); } double c_negated_wcss_index(const double* X, const Py_ssize_t* y, size_t n, size_t d, Py_ssize_t K) { WCSSIndex ind( CMatrix(X, n, d, /*_c_order=*/true), (Py_ssize_t)K, false, false/*not weighted*/ ); ind.set_labels(std::vector(y, y+n)); return (double)ind.compute(); } double c_silhouette_index(const double* X, const Py_ssize_t* y, size_t n, size_t d, Py_ssize_t K) { SilhouetteIndex ind( CMatrix(X, n, d, /*_c_order=*/true), (Py_ssize_t)K, false, false ); ind.set_labels(std::vector(y, y+n)); return (double)ind.compute(); } double c_silhouette_w_index(const double* X, const Py_ssize_t* y, size_t n, size_t d, Py_ssize_t K) { SilhouetteIndex ind( CMatrix(X, n, d, /*_c_order=*/true), (Py_ssize_t)K, false, true ); ind.set_labels(std::vector(y, y+n)); return (double)ind.compute(); } double c_wcnn_index(const double* X, const Py_ssize_t* y, size_t n, size_t d, Py_ssize_t K, size_t M) { if (M <= 0) throw std::invalid_argument("M must be positive."); WCNNIndex ind( CMatrix(X, n, d, /*_c_order=*/true), (Py_ssize_t)K, false, M ); ind.set_labels(std::vector(y, y+n)); return (double)ind.compute(); } double c_dunnowa_index(const double* X, const Py_ssize_t* y, size_t n, size_t d, Py_ssize_t K, size_t M, const char* owa_numerator, const char* owa_denominator) { int _owa_numerator = DuNNOWA_get_OWA(std::string(owa_numerator)); int _owa_denominator = DuNNOWA_get_OWA(std::string(owa_denominator)); if (_owa_numerator == OWA_ERROR || _owa_denominator == OWA_ERROR) throw std::invalid_argument("Invalid OWA operator specifier."); if (M <= 0) throw std::invalid_argument("M must be positive."); DuNNOWAIndex ind( CMatrix(X, n, d, /*_c_order=*/true), (Py_ssize_t)K, false, M, _owa_numerator, _owa_denominator ); ind.set_labels(std::vector(y, y+n)); return (double)ind.compute(); } double c_generalised_dunn_index(const double* X, const Py_ssize_t* y, size_t n, size_t d, Py_ssize_t K, size_t lowercase_d, size_t uppercase_d) { LowercaseDeltaFactory* lowercase_deltaFactory; UppercaseDeltaFactory* uppercase_deltaFactory; if (lowercase_d == 1) { lowercase_deltaFactory = new LowercaseDelta1Factory(); } else if (lowercase_d == 2) { lowercase_deltaFactory = new LowercaseDelta2Factory(); } else if (lowercase_d == 3) { lowercase_deltaFactory = new LowercaseDelta3Factory(); } else if (lowercase_d == 4) { lowercase_deltaFactory = new LowercaseDelta4Factory(); } else if (lowercase_d == 5) { lowercase_deltaFactory = new LowercaseDelta5Factory(); } else if (lowercase_d == 6) { lowercase_deltaFactory = new LowercaseDelta6Factory(); } else { throw std::invalid_argument("Invalid lowercase_d."); } if (uppercase_d == 1) { uppercase_deltaFactory = new UppercaseDelta1Factory(); } else if (uppercase_d == 2) { uppercase_deltaFactory = new UppercaseDelta2Factory(); } else if (uppercase_d == 3) { uppercase_deltaFactory = new UppercaseDelta3Factory(); } else { throw std::invalid_argument("Invalid uppercase_d."); } bool areCentroidsNeeded = ( lowercase_deltaFactory->IsCentroidNeeded() || uppercase_deltaFactory->IsCentroidNeeded() ); if (areCentroidsNeeded) { GeneralizedDunnIndexCentroidBased ind( CMatrix(X, n, d, /*_c_order=*/true), (Py_ssize_t)K, lowercase_deltaFactory, uppercase_deltaFactory); delete lowercase_deltaFactory; delete uppercase_deltaFactory; ind.set_labels(std::vector(y, y+n)); return (double)ind.compute(); } else { GeneralizedDunnIndex ind( CMatrix(X, n, d, /*_c_order=*/true), (Py_ssize_t)K, lowercase_deltaFactory, uppercase_deltaFactory); delete lowercase_deltaFactory; delete uppercase_deltaFactory; ind.set_labels(std::vector(y, y+n)); return (double)ind.compute(); } } #endif genieclust/src/cvi_generalized_dunn_uppercase_d3.h0000644000176200001440000001107214634041051022166 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * Copyleft (C) 2020, Maciej Bartoszuk * * For the 'genieclust' version: * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_GENERALIZED_DUNN_UPPERCASE_D3_H #define __CVI_GENERALIZED_DUNN_UPPERCASE_D3_H #include "cvi.h" #include "cvi_generalized_dunn_delta.h" class UppercaseDelta3 : public UppercaseDelta { protected: std::vector dist_sums; ///< sum of points distances to centroid: std::vector last_dist_sums; ///< for undo() bool last_chg; ///< for undo() (was dist changed at all?) Py_ssize_t cluster1; Py_ssize_t cluster2; public: UppercaseDelta3( EuclideanDistance& D, const CMatrix& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr ) : UppercaseDelta(D,X,L,count,K,n,d,centroids), dist_sums(K), last_dist_sums(K), last_chg(false) { } virtual void before_modify(size_t i, Py_ssize_t j) { last_chg = true; for (size_t u=0; u #include /*! Disjoint Sets (Union-Find) Data Structure * * A class to represent partitions of the set {0,1,...,n-1} for any n. * * Path compression for find() is implemented, * but the union() operation is naive (neither * it is union by rank nor by size), * see https://en.wikipedia.org/wiki/Disjoint-set_data_structure. * This is by design, as some other operations in the current * package rely on the assumption that the parent id of each * element is always <= than itself. */ class CDisjointSets { protected: Py_ssize_t n; //!< number of distinct elements Py_ssize_t k; //!< number of subsets std::vector par; /*!< par[i] is the id of the parent * of the i-th element */ public: /*! Starts with a "weak" partition { {0}, {1}, ..., {n-1} }, * i.e., n singletons. * * @param n number of elements, n>=0. */ CDisjointSets(Py_ssize_t n) : par(n) { // if (n < 0) throw std::domain_error("n < 0"); this->n = n; this->k = n; for (Py_ssize_t i=0; ipar[i] = i; } /*! A nullary constructor allows Cython to allocate * the instances on the stack. Do not use otherwise. */ CDisjointSets() : CDisjointSets(0) { } /*! Returns the current number of sets in the partition. */ Py_ssize_t get_k() const { return this->k; } /*! Returns the total cardinality of the set being partitioned. */ Py_ssize_t get_n() const { return this->n; } /*! Finds the subset id for a given x. * * @param x a value in {0,...,n-1} */ Py_ssize_t find(Py_ssize_t x) { if (x < 0 || x >= this->n) throw std::domain_error("x not in [0,n)"); if (this->par[x] != x) { this->par[x] = this->find(this->par[x]); } return this->par[x]; } /*! Merges the sets containing x and y. * * Let px be the parent id of x, and py be the parent id of y. * If px < py, then the new parent id of py will be set to py. * Otherwise, px will have py as its parent. * * If x and y are already members of the same subset, * an exception is thrown. * * @return the id of the parent of x or y, whichever is smaller. * * @param x a value in {0,...,n-1} * @param y a value in {0,...,n-1} */ virtual Py_ssize_t merge(Py_ssize_t x, Py_ssize_t y) { // well, union is a reserved C++ keyword :) x = this->find(x); // includes a range check for x y = this->find(y); // includes a range check for y if (x == y) throw std::invalid_argument("find(x) == find(y)"); if (y < x) std::swap(x, y); this->par[y] = x; this->k -= 1; return x; } }; #endif genieclust/src/cvi_generalized_dunn_uppercase_d1.h0000644000176200001440000000760014634041051022166 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * Copyleft (C) 2020, Maciej Bartoszuk * * For the 'genieclust' version: * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_GENERALIZED_DUNN_UPPERCASE_D1_H #define __CVI_GENERALIZED_DUNN_UPPERCASE_D1_H #include "cvi.h" #include "cvi_generalized_dunn_delta.h" class UppercaseDelta1 : public UppercaseDelta { protected: std::vector diam; /**< cluster diameters: diam[i] = max( X(u,), X(v,) ), X(u,), X(v,) in C_i */ std::vector last_diam; ///< for undo() bool last_chg; ///< for undo() (was diam changed at all?) bool needs_recompute; ///< for before and after modify public: UppercaseDelta1( EuclideanDistance& D, const CMatrix& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr ) : UppercaseDelta(D,X,L,count,K,n,d,centroids), diam(K), last_diam(K) { } virtual void before_modify(size_t i, Py_ssize_t j) { needs_recompute = false; for (size_t u=0; u diam[L[i]].d) { diam[L[i]] = DistTriple(i, u, d); last_chg = true; } } } } } virtual void undo(){ if (last_chg) { for (size_t i=0; i diam[L[i]].d) diam[L[i]] = DistTriple(i, j, d); } } } } virtual FLOAT_T compute(size_t k){ return sqrt(diam[k].d); } }; class UppercaseDelta1Factory : public UppercaseDeltaFactory { public: virtual bool IsCentroidNeeded() { return false; } virtual UppercaseDelta* create(EuclideanDistance& D, const CMatrix& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr) { return new UppercaseDelta1(D, X, L, count, K, n, d, centroids); } }; #endif genieclust/src/c_genie.h0000644000176200001440000010235614634041214014753 0ustar liggesusers/* The Genie++ Clustering Algorithm * * Copyleft (C) 2018-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_genie_h #define __c_genie_h #include "c_common.h" #include #include #include #include #include "c_gini_disjoint_sets.h" #include "c_int_dict.h" #include "c_preprocess.h" /*! Base class for CGenie and CGIc */ template class CGenieBase { protected: /*! Stores the clustering result as obtained by * CGenie::apply_genie() or CGIc::apply_gic() */ struct CGenieResult { CGiniDisjointSets ds; /*!< ds at the last iteration, it; * use denoise_index to obtain the final partition */ std::vector links; // deg; // denoise_index; // denoise_index_rev; //!< reverse look-up for denoise_index CCountDisjointSets forest_components; CGenieResult results; /*! When the Genie correction is on, some MST edges will be chosen * in a non-consecutive order. An array-based skiplist will speed up * searching within the to-be-consumed edges. Also, if there are * noise points, then the skiplist allows the algorithm * to naturally ignore edges that connect the leaves. */ void mst_skiplist_init(CIntDict* mst_skiplist) { // start with a list that skips all edges that lead to noise points mst_skiplist->clear(); for (Py_ssize_t i=0; in-1; ++i) { Py_ssize_t i1 = this->mst_i[i*2+0]; Py_ssize_t i2 = this->mst_i[i*2+1]; GENIECLUST_ASSERT(i1 < this->n) GENIECLUST_ASSERT(i2 < this->n) if (i1 < 0 || i2 < 0) { continue; // a no-edge -> ignore } if (!this->noise_leaves || (this->deg[i1]>1 && this->deg[i2]>1)) { (*mst_skiplist)[i] = i; /*only the key is important, not the value*/ } } } /** internal, used by get_labels(n_clusters, res) */ Py_ssize_t get_labels(CGiniDisjointSets* ds, Py_ssize_t* res) { std::vector res_cluster_id(n, -1); Py_ssize_t c = 0; for (Py_ssize_t i=0; idenoise_index_rev[i] >= 0) { // a non-noise point Py_ssize_t j = this->denoise_index[ ds->find(this->denoise_index_rev[i]) ]; if (res_cluster_id[j] < 0) { // new cluster res_cluster_id[j] = c; ++c; } res[i] = res_cluster_id[j]; } else { // a noise point res[i] = -1; } } return c; } public: CGenieBase(T* mst_d, Py_ssize_t* mst_i, Py_ssize_t n, bool noise_leaves) : deg(n), denoise_index(n), denoise_index_rev(n) { this->mst_d = mst_d; this->mst_i = mst_i; this->n = n; this->noise_leaves = noise_leaves; // Py_ssize_t missing_mst_edges = 0; for (Py_ssize_t i=0; ideg: Cget_graph_node_degrees(mst_i, n-1, n, this->deg.data()); // Create the non-noise points' translation table (for GiniDisjointSets) // and count the number of noise points if (noise_leaves) { noise_count = 0; Py_ssize_t j = 0; for (Py_ssize_t i=0; i= 2); GENIECLUST_ASSERT(j + noise_count == n); } else { // there are no noise points this->noise_count = 0; for (Py_ssize_t i=0; in - this->noise_count); for (Py_ssize_t i=0; in-1; ++i) { Py_ssize_t i1 = this->mst_i[i*2+0]; Py_ssize_t i2 = this->mst_i[i*2+1]; GENIECLUST_ASSERT(i1 < this->n) GENIECLUST_ASSERT(i2 < this->n) if (i1 < 0 || i2 < 0) { continue; // a no-edge -> ignore } if (!this->noise_leaves || (this->deg[i1]>1 && this->deg[i2]>1)) { forest_components.merge(this->denoise_index_rev[i1], this->denoise_index_rev[i2]); } } } /*! There can be at most n-noise_count singleton clusters * in the hierarchy. */ Py_ssize_t get_max_n_clusters() const { return this->n - this->noise_count; } /*! Propagate res with clustering results. * * Noise points get cluster id of -1. * * @param n_clusters maximal number of clusters to find * @param res [out] c_contiguous array of length n * * @return number of clusters detected (not including the noise cluster; * can be less than n_clusters) */ Py_ssize_t get_labels(Py_ssize_t n_clusters, Py_ssize_t* res) { if (this->results.ds.get_n() <= 0) throw std::runtime_error("Apply the clustering procedure first."); if (n_clusters <= this->results.n_clusters) { // use this->results.ds -- from the final iteration return this->get_labels(&(this->results.ds), res); } else { CGiniDisjointSets ds(this->get_max_n_clusters()); for (Py_ssize_t it=0; itget_max_n_clusters() - n_clusters; ++it) { Py_ssize_t j = (this->results.links[it]); if (j < 0) break; // remaining are no-edges Py_ssize_t i1 = this->mst_i[2*j+0]; Py_ssize_t i2 = this->mst_i[2*j+1]; GENIECLUST_ASSERT(i1 >= 0) GENIECLUST_ASSERT(i2 >= 0) ds.merge(this->denoise_index_rev[i1], this->denoise_index_rev[i2]); } return this->get_labels(&ds, res); } } /*! Propagate res with clustering results - * all partitions from cardinality n_clusters to 1. * * Noise points get cluster id of -1. * * @param n_clusters maximal number of clusters to find * @param res [out] c_contiguous matrix of shape (n_clusters+1, n) */ void get_labels_matrix(Py_ssize_t n_clusters, Py_ssize_t* res) { if (this->get_max_n_clusters() < n_clusters) { // there is nothing to do, no merge needed. throw std::runtime_error("The requested number of clusters \ is too large with this many detected noise points"); } if (this->results.ds.get_n() <= 0) throw std::runtime_error("Apply the clustering procedure first."); if (n_clusters < this->forest_components.get_k()) { n_clusters = this->forest_components.get_k(); } CGiniDisjointSets ds(this->get_max_n_clusters()); // you can do up to this->get_max_n_clusters() - 1 merges Py_ssize_t cur_cluster = n_clusters; if (this->get_max_n_clusters() == n_clusters) { cur_cluster--; GENIECLUST_ASSERT(cur_cluster >= 0) this->get_labels(&ds, &res[cur_cluster * this->n]); } for (Py_ssize_t it=0; itget_max_n_clusters() - 1; ++it) { Py_ssize_t j = (this->results.links[it]); if (j >= 0) { // might not be true if forest_components.get_k() > 1 Py_ssize_t i1 = this->mst_i[2*j+0]; Py_ssize_t i2 = this->mst_i[2*j+1]; GENIECLUST_ASSERT(i1 >= 0 && i2 >= 0) ds.merge(this->denoise_index_rev[i1], this->denoise_index_rev[i2]); } if (it >= this->get_max_n_clusters() - n_clusters - 1) { cur_cluster--; GENIECLUST_ASSERT(cur_cluster >= 0) this->get_labels(&ds, &res[cur_cluster * this->n]); } } GENIECLUST_ASSERT(cur_cluster == 0) } /*! Propagate res with clustering results - * based on the current this->results.links. * * If there are noise points, not all elements in res will be set. * * @param res [out] c_contiguous array of length n-1, * res[i] gives the index of the MST edge merged at the i-th iteration. * * @return number of items in res set (the array is padded with -1s) */ Py_ssize_t get_links(Py_ssize_t* res) { if (this->results.ds.get_n() <= 0) throw std::runtime_error("Apply the clustering procedure first."); for (Py_ssize_t i=0; iresults.it; ++i) { res[i] = this->results.links[i]; } for (Py_ssize_t i=this->results.it; in-1; ++i) { res[i] = -1; } return this->results.it; } /*! Set res[i] to true if the i-th point is a noise one. * * Makes sense only if noise_leaves==true * * @param res [out] array of length n */ void get_noise_status(bool* res) const { for (Py_ssize_t i=0; inoise_leaves && this->deg[i] <= 1); } } }; /*! The Genie++ Hierarchical Clustering Algorithm * * The Genie algorithm (Gagolewski et al., 2016) links two clusters * in such a way that a chosen economic inequality measure * (here, the Gini index) of the cluster sizes does not increase drastically * above a given threshold. The method most often outperforms * the Ward or average linkage, k-means, spectral clustering, * DBSCAN, Birch and others in terms of the clustering * quality on benchmark data while retaining the speed of the single * linkage algorithm. * * This is a re-implementation of the original (Gagolewski et al., 2016) * algorithm. New features include: * 1. Given a pre-computed minimum spanning tree (MST), * it only requires amortised O(n sqrt(n))-time. * 2. MST leaves can be * marked as noise points (if `noise_leaves==True`). This is useful, * if the Genie algorithm is applied on the MST with respect to * the HDBSCAN-like mutual reachability distance. * 3. (option-experimental) During merge, first pair of clusters that would * give a decrease of the Gini index below the threshold is chosen * (or the one that gives the smallest Gini index if that's not possible) * -- turns out to be slower and worse on benchmark data. * 4. The MST need not be connected (is a spanning forest) (e.g., if it * computed based on a disconnected k-NN graph) - each component * will never be merged with any other one. * * * * References * =========== * * Gagolewski M., Bartoszuk M., Cena A., * Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, * Information Sciences 363, 2016, pp. 8-23. doi:10.1016/j.ins.2016.05.003 */ template class CGenie : public CGenieBase { protected: bool experimental_forced_merge; //* mst_skiplist, Py_ssize_t n_clusters, double gini_threshold, std::vector* links) { if (n_clusters > this->get_max_n_clusters()) { // there is nothing to do, no merge needed. throw std::runtime_error("The requested number of clusters \ is too large with this many detected noise points"); } if (n_clusters < this->forest_components.get_k()) { n_clusters = this->forest_components.get_k(); // throw std::runtime_error("The requested number of clusters // is too small as the MST is not connected"); } // mst_skiplist contains all mst_i edge indexes // that we need to consider, and nothing more. GENIECLUST_ASSERT(!mst_skiplist->empty()); Py_ssize_t lastidx = mst_skiplist->get_key_min(); Py_ssize_t lastm = 0; // last minimal cluster size Py_ssize_t it = 0; while (!mst_skiplist->empty() && ds->get_k() > n_clusters) { // determine the pair of vertices to merge Py_ssize_t i1; Py_ssize_t i2; if (ds->get_gini() > gini_threshold) { // the Genie correction for inequality of cluster sizes Py_ssize_t m = ds->get_smallest_count(); if (m != lastm || lastidx < mst_skiplist->get_key_min()) { // need to start from the beginning of the MST skiplist lastidx = mst_skiplist->get_key_min(); } // else reuse lastidx GENIECLUST_ASSERT(lastidx < this->n - 1) GENIECLUST_ASSERT(lastidx >= 0 && lastidx < this->n - 1); GENIECLUST_ASSERT(this->mst_i[2*lastidx+0] >= 0 && this->mst_i[2*lastidx+1] >= 0); // find the MST edge connecting a cluster of the smallest size // with another one while (ds->get_count(this->denoise_index_rev[this->mst_i[2*lastidx+0]]) != m && ds->get_count(this->denoise_index_rev[this->mst_i[2*lastidx+1]]) != m) { lastidx = mst_skiplist->get_key_next(lastidx); GENIECLUST_ASSERT(lastidx >= 0 && lastidx < this->n - 1); GENIECLUST_ASSERT(this->mst_i[2*lastidx+0] >= 0 && this->mst_i[2*lastidx+1] >= 0); } i1 = this->mst_i[2*lastidx+0]; i2 = this->mst_i[2*lastidx+1]; (*links)[it] = lastidx; Py_ssize_t delme = lastidx; lastidx = mst_skiplist->get_key_next(lastidx); mst_skiplist->erase(delme); // O(1) lastm = m; } else { // single linkage-like // note that we consume the MST edges in an non-decreasing order w.r.t. weights Py_ssize_t curidx = mst_skiplist->pop_key_min(); GENIECLUST_ASSERT(curidx >= 0 && curidx < this->n - 1); i1 = this->mst_i[2*curidx+0]; i2 = this->mst_i[2*curidx+1]; (*links)[it] = curidx; } GENIECLUST_ASSERT(i1 >= 0 && i2 >= 0) Py_ssize_t i1r = this->denoise_index_rev[i1]; Py_ssize_t i2r = this->denoise_index_rev[i2]; bool forget = this->forest_components.get_k() > 1 && this->forest_components.find(i1r) == this->forest_components.find(i2r) && this->forest_components.get_count(i1r) == ds->get_count(i1r) + ds->get_count(i2r); if (forget) { ds->merge_and_forget(i1r, i2r); } else { ds->merge(i1r, i2r); } it++; } return it; // number of merges performed } /*! Merge a pair of sets that reduces the Gini index below the threshold * (provided that is possible) * * **EXPERIMENTAL** This is slower and not that awesome. * * @param ds * @param mst_skiplist * @param n_clusters maximal number of clusters to detect * @param gini_threshold * @param links [out] c_contiguous array of size (n-1), * links[iter] = index of merged mst_i (up to the number of performed * merges, see retval). * * @return The number of performed merges. */ Py_ssize_t do_genie_experimental_forced_merge(CGiniDisjointSets* ds, CIntDict* mst_skiplist, Py_ssize_t n_clusters, double gini_threshold, std::vector* links) { if (n_clusters > this->get_max_n_clusters()) { // there is nothing to do, no merge needed. throw std::runtime_error("The requested number of clusters \ is too large with this many detected noise points"); } if (n_clusters < this->forest_components.get_k()) { n_clusters = this->forest_components.get_k(); // throw std::runtime_error("The requested number of clusters // is too small as the MST is not connected"); } // mst_skiplist contains all mst_i edge indexes // that we need to consider, and nothing more. GENIECLUST_ASSERT(!mst_skiplist->empty()); Py_ssize_t it = 0; while (!mst_skiplist->empty() && ds->get_k() > n_clusters) { // determine the pair of vertices to merge Py_ssize_t last_idx = mst_skiplist->get_key_min(); double best_gini = 1.0; Py_ssize_t best_idx = last_idx; while (1) { Py_ssize_t i1 = this->mst_i[2*last_idx+0]; Py_ssize_t i2 = this->mst_i[2*last_idx+1]; Py_ssize_t i1r = this->denoise_index_rev[i1]; Py_ssize_t i2r = this->denoise_index_rev[i2]; bool forget = this->forest_components.get_k() > 1 && this->forest_components.find(i1r) == this->forest_components.find(i2r) && this->forest_components.get_count(i1r) == ds->get_count(i1r) + ds->get_count(i2r); double test_gini = ds->test_gini_after_merge(i1r, i2r, forget); if (test_gini < best_gini) { best_gini = test_gini; best_idx = last_idx; } // printf(" %ld-%ld %.3lf %.3lf\n", i1r, i2r, test_gini, gini_threshold); if (best_gini <= gini_threshold) break; if (last_idx == mst_skiplist->get_key_max()) break; last_idx = mst_skiplist->get_key_next(last_idx); } Py_ssize_t i1 = this->mst_i[2*best_idx+0]; Py_ssize_t i2 = this->mst_i[2*best_idx+1]; Py_ssize_t i1r = this->denoise_index_rev[i1]; Py_ssize_t i2r = this->denoise_index_rev[i2]; bool forget = this->forest_components.get_k() > 1 && this->forest_components.find(i1r) == this->forest_components.find(i2r) && this->forest_components.get_count(i1r) == ds->get_count(i1r) + ds->get_count(i2r); (*links)[it] = best_idx; mst_skiplist->erase(best_idx); // O(1) if (forget) ds->merge_and_forget(i1r, i2r); else ds->merge(i1r, i2r); // printf("%ld-%ld %.3lf\n", i1r, i2r, ds->get_gini()); it++; } return it; // number of merges performed } public: CGenie(T* mst_d, Py_ssize_t* mst_i, Py_ssize_t n, bool noise_leaves=false, bool experimental_forced_merge=false) : CGenieBase(mst_d, mst_i, n, noise_leaves), experimental_forced_merge(experimental_forced_merge) { ; } CGenie() : CGenie(NULL, NULL, 0, false) { } /*! Run the Genie++ algorithm * * @param n_clusters number of clusters to find, 1 for the complete hierarchy * (warning: the algorithm might stop early if there are many noise points * or the number of clusters to detect is > 1). * @param gini_threshold the Gini index threshold */ void apply_genie(Py_ssize_t n_clusters, double gini_threshold) { if (n_clusters < 1) throw std::domain_error("n_clusters must be >= 1"); this->results = typename CGenieBase::CGenieResult(this->n, this->noise_count, n_clusters); CIntDict mst_skiplist(this->n - 1); this->mst_skiplist_init(&mst_skiplist); if (experimental_forced_merge) { this->results.it = this->do_genie_experimental_forced_merge(&(this->results.ds), &mst_skiplist, n_clusters, gini_threshold, &(this->results.links)); } else { this->results.it = this->do_genie(&(this->results.ds), &mst_skiplist, n_clusters, gini_threshold, &(this->results.links)); } } }; /*! GIc (Genie+Information Criterion) Hierarchical Clustering Algorithm * * GIc has been originally proposed by Anna Cena in [1] and was inspired * by Mueller's (et al.) ITM [2] and Gagolewski's (et al.) Genie [3]; * see also [4]. * * References * ========== * * [1] Cena A., Adaptive hierarchical clustering algorithms based on * data aggregation methods, PhD Thesis, Systems Research Institute, * Polish Academy of Sciences 2018. * * [2] Mueller A., Nowozin S., Lampert C.H., Information Theoretic * Clustering using Minimum Spanning Trees, DAGM-OAGM 2012. * * [3] Gagolewski M., Bartoszuk M., Cena A., * Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, * Information Sciences 363, 2016, pp. 8-23. doi:10.1016/j.ins.2016.05.003 * * [4] Gagolewski M., Cena A., Bartoszuk M., Brzozowski L., * Clustering with Minimum Spanning Trees: How Good Can It Be?, * in preparation, 2023. */ template class CGIc : public CGenie { protected: /*! Run the Genie++ algorithm with different thresholds for the Gini index * and determine the intersection of all the resulting * n_clusters-partitions; for this, we need the union of the * set of MST edges that were left "unmerged". * * @param n_clusters number of clusters to look for in Genie run * @param gini_thresholds array of floats in [0,1] * @param n_thresholds size of gini_thresholds * * @return indexes of MST edges that were left unused by at least * one Genie algorithm run; this gives the intersection of partitions. * The resulting list will contain a sentinel, this->n - 1. * * If n_thresholds is 0 or the requested n_clusters is too large, * all non-noise edges are set as unused. */ std::vector get_intersection_of_genies(Py_ssize_t n_clusters, double* gini_thresholds, Py_ssize_t n_thresholds) { std::vector unused_edges; if (n_thresholds == 0 || n_clusters >= this->get_max_n_clusters()) { // all edges unused -> will start from n singletons for (Py_ssize_t i=0; i < this->n - 1; ++i) { Py_ssize_t i1 = this->mst_i[2*i+0]; Py_ssize_t i2 = this->mst_i[2*i+1]; if (i1 < 0 || i2 < 0) continue; // a no-edge -> ignore if (!this->noise_leaves || (this->deg[i1] > 1 && this->deg[i2] > 1)) unused_edges.push_back(i); } unused_edges.push_back(this->n - 1); // sentinel return unused_edges; // EOF. } else { // the same initial skiplist is used in each iter: CIntDict mst_skiplist_template(this->n-1); this->mst_skiplist_init(&mst_skiplist_template); for (Py_ssize_t i=0; iget_max_n_clusters()); std::vector links(this->n - 1, -1); // the history of edge merges CIntDict mst_skiplist(mst_skiplist_template); this->do_genie(&ds, &mst_skiplist, n_clusters, gini_threshold, &links); // start where do_genie() concluded; add all remaining MST edges // to the list of unused_edges while (!mst_skiplist.empty()) unused_edges.push_back(mst_skiplist.pop_key_min()); } // let unused_edges = sort(unique(unused_edges)) unused_edges.push_back(this->n - 1); // sentinel std::sort(unused_edges.begin(), unused_edges.end()); // sorted, but some might not be unique, so let's remove dups Py_ssize_t k = 0; for (Py_ssize_t i=1; i<(Py_ssize_t)unused_edges.size(); ++i) { if (unused_edges[i] != unused_edges[k]) { k++; unused_edges[k] = unused_edges[i]; } } unused_edges.resize(k+1); GENIECLUST_ASSERT(unused_edges[k] == this->n - 1); return unused_edges; } } public: CGIc(T* mst_d, Py_ssize_t* mst_i, Py_ssize_t n, bool noise_leaves=false) : CGenie(mst_d, mst_i, n, noise_leaves) { if (this->forest_components.get_k() > 1) throw std::domain_error("MST is not connected; this is not (yet) supported"); } CGIc() : CGIc(NULL, NULL, 0, false) { } /*! Run the GIc (Genie+Information Criterion) algorithm * * @param n_clusters maximal number of clusters to find, * 1 for the complete hierarchy (if possible) * @param add_clusters number of additional clusters to work * with internally * @param n_features number of features (can be fractional) * @param gini_thresholds array of size n_thresholds * @param n_thresholds size of gini_thresholds */ void apply_gic(Py_ssize_t n_clusters, Py_ssize_t add_clusters, double n_features, double* gini_thresholds, Py_ssize_t n_thresholds) { if (n_clusters < 1) throw std::domain_error("n_clusters must be >= 1"); GENIECLUST_ASSERT(add_clusters>=0); GENIECLUST_ASSERT(n_thresholds>=0); std::vector unused_edges = get_intersection_of_genies( n_clusters+add_clusters, gini_thresholds, n_thresholds ); // note that the unused_edges list: // 1. does not include noise edges; // 2. is sorted (strictly) increasingly // 3. contains a sentinel element at the end == n-1 this->results = typename CGenieBase::CGenieResult(this->n, this->noise_count, n_clusters); // Step 1. Merge all used edges (used by all the Genies) // There are of course many possible merge orders that we could consider // here. We will rely on the current ordering of the MST edges, // which is wrt non-decreasing mst_d. Py_ssize_t cur_unused_edges = 0; Py_ssize_t num_unused_edges = unused_edges.size()-1; // ignore sentinel std::vector cluster_sizes(this->get_max_n_clusters(), 1); std::vector cluster_d_sums(this->get_max_n_clusters(), (T)0.0); this->results.it = 0; for (Py_ssize_t i=0; in - 1; ++i) { GENIECLUST_ASSERT(i<=unused_edges[cur_unused_edges]); if (unused_edges[cur_unused_edges] == i) { // ignore current edge and advance to the next unused edge cur_unused_edges++; continue; } Py_ssize_t i1 = this->mst_i[2*i+0]; Py_ssize_t i2 = this->mst_i[2*i+1]; if (i1 < 0 || i2 < 0) continue; // a no-edge -> ignore if (!this->noise_leaves || (this->deg[i1] > 1 && this->deg[i2] > 1)) { GENIECLUST_ASSERT(this->results.it < this->n-1); this->results.links[this->results.it++] = i; i1 = this->results.ds.find(this->denoise_index_rev[i1]); i2 = this->results.ds.find(this->denoise_index_rev[i2]); if (i1 > i2) std::swap(i1, i2); this->results.ds.merge(i1, i2); // new parent node is i1 cluster_sizes[i1] += cluster_sizes[i2]; cluster_d_sums[i1] += cluster_d_sums[i2] + this->mst_d[i]; cluster_sizes[i2] = 0; cluster_d_sums[i2] = INFTY; } } GENIECLUST_ASSERT(cur_unused_edges == num_unused_edges); // sentinel GENIECLUST_ASSERT(unused_edges[num_unused_edges] == this->n-1); // sentinel GENIECLUST_ASSERT(num_unused_edges+1 == this->results.ds.get_k()); // Step 2. Merge all used edges /* The objective function - Information Criterion - to MAXIMISE is sum_{i in ds.parents()} -cluster_sizes[i] * ( n_features * log cluster_sizes[i] -(n_features-1) * log cluster_d_sums[i] ) */ while (num_unused_edges > 0 && this->results.itget_max_n_clusters() - n_clusters) { Py_ssize_t max_which = -1; double max_obj = -INFTY; for (Py_ssize_t j=0; jmst_i[2*i+0]; Py_ssize_t i2 = this->mst_i[2*i+1]; GENIECLUST_ASSERT(i1 >= 0 && i2 >= 0); i1 = this->results.ds.find(this->denoise_index_rev[i1]); i2 = this->results.ds.find(this->denoise_index_rev[i2]); if (i1 > i2) std::swap(i1, i2); GENIECLUST_ASSERT(i1 != i2); // singletons should be merged first // (we assume that they have cluster_d_sums==Inf // (this was not addressed in Mueller's paper) if (cluster_d_sums[i1] < 1e-12 || cluster_d_sums[i2] < 1e-12) { max_which = j; break; } double cur_obj = -(cluster_sizes[i1]+cluster_sizes[i2])*( n_features*std::log((double)cluster_d_sums[i1]+cluster_d_sums[i2]+this->mst_d[i]) -(n_features-1.0)*std::log((double)cluster_sizes[i1]+cluster_sizes[i2]) ); cur_obj += cluster_sizes[i1]*( n_features*std::log((double)cluster_d_sums[i1]) -(n_features-1.0)*std::log((double)cluster_sizes[i1]) ); cur_obj += cluster_sizes[i2]*( n_features*std::log((double)cluster_d_sums[i2]) -(n_features-1.0)*std::log((double)cluster_sizes[i2]) ); GENIECLUST_ASSERT(std::isfinite(cur_obj)); if (cur_obj > max_obj) { max_obj = cur_obj; max_which = j; } } GENIECLUST_ASSERT(max_which >= 0 && max_which < num_unused_edges); Py_ssize_t i = unused_edges[max_which]; GENIECLUST_ASSERT(this->results.it < this->n - 1); this->results.links[this->results.it++] = i; Py_ssize_t i1 = this->mst_i[2*i+0]; Py_ssize_t i2 = this->mst_i[2*i+1]; GENIECLUST_ASSERT(i1 >= 0 && i2 >= 0); i1 = this->results.ds.find(this->denoise_index_rev[i1]); i2 = this->results.ds.find(this->denoise_index_rev[i2]); if (i1 > i2) std::swap(i1, i2); this->results.ds.merge(i1, i2); // new parent node is i1 cluster_sizes[i1] += cluster_sizes[i2]; cluster_d_sums[i1] += cluster_d_sums[i2]+this->mst_d[i]; cluster_sizes[i2] = 0; cluster_d_sums[i2] = INFTY; unused_edges[max_which] = unused_edges[num_unused_edges-1]; num_unused_edges--; } } }; #endif genieclust/src/cvi_generalized_dunn_lowercase_d4.h0000644000176200001440000000513014634041051022162 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * Copyleft (C) 2020, Maciej Bartoszuk * * For the 'genieclust' version: * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_GENERALIZED_DUNN_LOWERCASE_D4_H #define __CVI_GENERALIZED_DUNN_LOWERCASE_D4_H #include "cvi.h" #include "cvi_generalized_dunn_delta.h" class LowercaseDelta4 : public LowercaseDelta { public: LowercaseDelta4( EuclideanDistance& D, const CMatrix& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr ) : LowercaseDelta(D, X, L, count,K,n,d,centroids) { } virtual void before_modify(size_t i, Py_ssize_t j) { // all happens in CentroidsBasedIndex } virtual void after_modify(size_t i, Py_ssize_t j) { // all happens in CentroidsBasedIndex } virtual void undo() { // all happens in CentroidsBasedIndex } virtual void recompute_all() { // all happens in CentroidsBasedIndex } virtual FLOAT_T compute(size_t k, size_t l) { FLOAT_T act = 0.0; for (size_t u=0; u& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr) { return new LowercaseDelta4(D, X, L, count, K, n, d, centroids); } }; #endif genieclust/src/cvi_generalized_dunn_lowercase_d2.h0000644000176200001440000000522414634041051022164 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * Copyleft (C) 2020, Maciej Bartoszuk * * For the 'genieclust' version: * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_GENERALIZED_DUNN_LOWERCASE_D2_H #define __CVI_GENERALIZED_DUNN_LOWERCASE_D2_H #include "cvi.h" #include "cvi_generalized_dunn_delta.h" class LowercaseDelta2 : public LowercaseDelta1 { public: LowercaseDelta2( EuclideanDistance& D, const CMatrix& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr ) : LowercaseDelta1(D, X, L, count, K, n, d, centroids) { comparator = std::greater(); } virtual void recompute_all() { for (size_t i=0; i& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr) { return new LowercaseDelta2(D, X, L, count, K, n, d, centroids); } }; #endif genieclust/src/cvi_generalized_dunn_lowercase_d5.h0000644000176200001440000001071614634041051022171 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * Copyleft (C) 2020, Maciej Bartoszuk * * For the 'genieclust' version: * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_GENERALIZED_DUNN_LOWERCASE_D5_H #define __CVI_GENERALIZED_DUNN_LOWERCASE_D5_H #include "cvi.h" #include "cvi_generalized_dunn_delta.h" class LowercaseDelta5 : public LowercaseDelta { protected: std::vector dist_sums; ///< sum of points distances to centroid: std::vector last_dist_sums; ///< for undo() bool last_chg; ///< for undo() (was dist changed at all?) Py_ssize_t cluster1; Py_ssize_t cluster2; public: LowercaseDelta5( EuclideanDistance& D, const CMatrix& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr ) : LowercaseDelta(D,X,L,count,K,n,d,centroids), dist_sums(K), last_dist_sums(K), last_chg(false) { } virtual void before_modify(size_t i, Py_ssize_t j) { last_chg = true; for (size_t u=0; u& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr) { return new LowercaseDelta5(D, X, L, count, K, n, d, centroids); } }; #endif genieclust/src/c_inequality.h0000644000176200001440000001207214634041214016043 0ustar liggesusers/* Inequality Measures * * Copyleft (C) 2018-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_inequality_h #define __c_inequality_h #include "c_common.h" #include /*! The Normalised Gini Index * * The normalised Gini index is given by: * $$ * G(x_1,\dots,x_n) = \frac{ * \sum_{i=1}^{n-1} \sum_{j=i+1}^n |x_i-x_j| * }{ * (n-1) \sum_{i=1}^n x_i * }. * $$ * * Time complexity: $O(n)$ for sorted data; it holds: * $$ * G(x_1,\dots,x_n) = \frac{ * \sum_{i=1}^{n} (n-2i+1) x_{\sigma(n-i+1)} * }{ * (n-1) \sum_{i=1}^n x_i * }, * $$ * where $\sigma$ is an ordering permutation of $(x_1,\dots,x_n)$. * * * References * ---------- * * Gini C., Variabilita e Mutabilita, Tipografia di Paolo Cuppini, * Bologna, 1912. * * * @param x non-decreasingly sorted c_contiguous input vector >= 0 * @param n length of x * * @return the value of the inequality index, a number in [0,1]. */ template double Cgini_sorted(const T* x, Py_ssize_t n) { double s = 0.0, t = 0.0; GENIECLUST_ASSERT(x[0] >= 0); GENIECLUST_ASSERT(x[n-1] > 0); for (Py_ssize_t i=1; i<=n; ++i) { t += x[n-i]; s += (n-2.0*i+1.0)*x[n-i]; } s = s/(n-1.0)/t; if (s > 1.0) return 1.0; else if (s < 0.0) return 0.0; else return s; } /*! The Normalised Bonferroni Index * * The normalised Bonferroni index is given by: * $$ * B(x_1,\dots,x_n) = \frac{ * \sum_{i=1}^{n} \left( n-\sum_{j=1}^i \frac{n}{n-j+1} \right) * x_{\sigma(n-i+1)} * }{ * (n-1) \sum_{i=1}^n x_i * }, * $$ * where $\sigma$ is an ordering permutation of $(x_1,\dots,x_n)$. * * Time complexity: $O(n)$ for sorted data. * * * References * ---------- * * Bonferroni C., Elementi di Statistica Generale, Libreria Seber, * Firenze, 1930. * * @param x non-decreasingly sorted c_contiguous input vector >= 0 * @param n length of x * * @return the value of the inequality index, a number in [0,1]. */ template double Cbonferroni_sorted(const T* x, Py_ssize_t n) { double s = 0.0, t = 0.0, c = 0.0; GENIECLUST_ASSERT(x[0] >= 0); GENIECLUST_ASSERT(x[n-1] > 0); for (Py_ssize_t i=1; i<=n; ++i) { c += n/(n-i+1.0); t += x[n-i]; s += (n-c)*x[n-i]; } s = s/(n-1.0)/t; if (s > 1.0) return 1.0; else if (s < 0.0) return 0.0; else return s; } // #cpdef np.float64_t coefvar(np.ndarray[T] x, bint is_sorted=False): // #""" // #Coefficient of variation // // #$$ // #C(x_1,\dots,x_n) = \sqrt{\frac{ // #\sum_{i=1}^{n-1} \sum_{j=i+1}^n (x_i-x_j)^2 // #}{ // #(n-1) \sum_{i=1}^n x_i^2 // #}}. // #$$ // // #Is this an inequality measures BTW? // #""" // // ## sorting is not necessary // #cdef unsigned int n = len(x) // #cdef np.float64_t s = 0.0, t = square(x[0]) // #cdef unsigned int i, j // // #for i in range(n-1): // #t += square(x[i+1]) // #for j in range(i+1, n): // #s += square(x[i]-x[j]) // // #return cmath.sqrt(s/(n-1.0)/t) // // // # cpdef np.float64_t vergottini(np.ndarray[T] x, bint is_sorted=False): // # "de Vergottini index // # x <- sort(x, decreasing=TRUE) // # n <- length(x) // # vmax <- sum(1/(2:n)) // # (sum(sapply(1:length(x), function(i) mean(x[1:i])))/sum(x)-1)/vmax // # } /*! The Normalised De Vergottini Index * * The normalised De Vergottini index is given by: * $$ * V(x_1,\dots,x_n) = \frac{1}{\sum_{i=2}^n \frac{1}{i}} \left( * \frac{ \sum_{i=1}^n \left( \sum_{j=i}^{n} \frac{1}{j}\right) * x_{\sigma(n-i+1)} }{\sum_{i=1}^{n} x_i} - 1 * \right) * $$ * where $\sigma$ is an ordering permutation of $(x_1,\dots,x_n)$. * * Time complexity: $O(n)$ for sorted data. * * * * * @param x non-decreasingly sorted c_contiguous input vector >= 0 * @param n length of x * * @return the value of the inequality index, a number in [0,1]. */ template double Cdevergottini_sorted(const T* x, Py_ssize_t n) { double s = 0.0, t = 0.0, c = 0.0, f=0.0, d=0.0; GENIECLUST_ASSERT(x[0] >= 0); GENIECLUST_ASSERT(x[n-1] > 0); for (Py_ssize_t i=2; i<=n; ++i) c += 1.0/(double)i; for (Py_ssize_t i=1; i<=n; ++i) { t += x[i-1]; f += 1.0/(double)(n-i+1); d += f*x[i-1]; // the i-th smallest } s = (d/t-1.0)/c; if (s > 1.0) return 1.0; else if (s < 0.0) return 0.0; else return s; } #endif genieclust/src/c_argfuns.h0000644000176200001440000000712514634041214015327 0ustar liggesusers/* Some sort/search/vector indexing-related functions * missing in the Standard Library, including the ones to: * a. find the (stable) ordering permutation of a vector * b. find the k-th smallest value in a vector * * Copyleft (C) 2018-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_argsort_h #define __c_argsort_h #include "c_common.h" #include /*! Comparer for argsort(). * * Ensures the resulting permutation is stable. */ template struct __argsort_comparer { const T* x; __argsort_comparer(const T* x) { this->x = x; } bool operator()(Py_ssize_t i, Py_ssize_t j) const { return this->x[i] < this->x[j] || (this->x[i] == this->x[j] && i < j); } }; /*! Finds an(*) ordering permutation w.r.t. \lt. * * Both ret and x should be of the same length n; * ret will be overwritten. * * (*) or THE stable one, if stable=true, which is the default. * * We call permutation o stable, whenever i void Cargsort(Py_ssize_t* ret, const T* x, Py_ssize_t n, bool stable=true) { if (n <= 0) throw std::domain_error("n <= 0"); for (Py_ssize_t i=0; i(x)); else std::sort(ret, ret+n, __argsort_comparer(x)); } /*! Returns the index of the (k-1)-th smallest value in an array x. * * argkmin(x, 0) == argmin(x), or, more generally, * argkmin(x, k) == np.argsort(x)[k]. * * Run time: O(nk), where n == len(x). Working mem: O(k). * Does not modify x. * * In practice, very fast for small k and randomly ordered * or almost sorted (increasingly) data. * * * If buf is not NULL, it must be of length at least k+1. * * @param x data * @param n length of x * @param k value in {0,...,n-1}, preferably small * @param buf optional working buffer of size >= k+1, will be overwritten */ template Py_ssize_t Cargkmin(const T* x, Py_ssize_t n, Py_ssize_t k, Py_ssize_t* buf=NULL) { Py_ssize_t* idx; if (n <= 0) throw std::domain_error("n <= 0"); if (k >= n) throw std::domain_error("k >= n"); k += 1; if (!buf) idx = new Py_ssize_t[k]; else idx = buf; for (Py_ssize_t i=0; i 0 && x[i] < x[idx[j-1]]) { idx[j] = idx[j-1]; j -= 1; } idx[j] = i; } for (Py_ssize_t i=k; i 0 && x[i] < x[idx[j-1]]) { idx[j] = idx[j-1]; j -= 1; } idx[j] = i; } Py_ssize_t ret = idx[k-1]; if (!buf) delete [] idx; return ret; } #endif genieclust/src/r_emst.cpp0000644000176200001440000000355314634041214015205 0ustar liggesusers/* Calls RcppMLPACK::DualTreeBoruvka::ComputeMST * * Copyleft (C) 2018-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ // version 1.0.0 now uses mlpack::emst // #include // // #include // #include // #include "c_common.h" // // // // // Euclidean MST via MLPACK // // Calls RcppMLPACK::DualTreeBoruvka::ComputeMST // // [[Rcpp::export(".emst_mlpack")]] // Rcpp::NumericMatrix dot_emst_mlpack(Rcpp::NumericMatrix X) // { // Py_ssize_t n = X.nrow(); // Py_ssize_t d = X.ncol(); // // // Let aX = transpose(X) // arma::Mat aX(d, n); // for (Py_ssize_t i=0; i aret; // mlpack::emst::DualTreeBoruvka<>(aX).ComputeMST(aret); // // Rcpp::NumericMatrix ret(n-1, 3); // for (Py_ssize_t i=0; i 1-based) // ret(i, 1) = aret(1, i)+1; // greater edge index // ret(i, 2) = aret(2, i); // distance between the pair of points // GENIECLUST_ASSERT(ret(i, 0) < ret(i, 1)); // GENIECLUST_ASSERT(i == 0 || ret(i-1, 2) <= ret(i, 2)); // } // // return ret; // } genieclust/src/cvi_generalized_dunn_lowercase_d3.h0000644000176200001440000001016614634041051022166 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * Copyleft (C) 2020, Maciej Bartoszuk * * For the 'genieclust' version: * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_GENERALIZED_DUNN_LOWERCASE_D3_H #define __CVI_GENERALIZED_DUNN_LOWERCASE_D3_H #include "cvi.h" #include "cvi_generalized_dunn_delta.h" class LowercaseDelta3 : public LowercaseDelta { protected: CMatrix dist_sums; /**< intra-cluster sums: dist(i,j) = min( X(u,), X(v,) ), X(u,) in C_i, X(v,) in C_j (i!=j) */ CMatrix last_dist_sums; ///< for undo() bool last_chg; ///< for undo() (was dist changed at all?) public: LowercaseDelta3( EuclideanDistance& D, const CMatrix& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr ) : LowercaseDelta(D, X, L,count,K,n,d,centroids), dist_sums(K, K), last_dist_sums(K, K), last_chg(false) { } virtual void before_modify(size_t i, Py_ssize_t j) { for (size_t u=0; u& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr) { return new LowercaseDelta3(D, X, L, count, K, n, d, centroids); } }; #endif genieclust/src/cvi_silhouette.h0000644000176200001440000001211214634041051016375 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_SILHOUETTE_H #define __CVI_SILHOUETTE_H #include "cvi.h" /** The Silhouette Coefficient * * Overall average per-point silhouette scores (widths=false) * or mean of the cluster average silhouette widths (widths=true) as defined * in Sec.2 of (Rousseeuw, 1987). * * * P.J. Rousseeuw, Silhouettes: A graphical aid to the interpretation and * validation of cluster analysis, Computational and Applied Mathematics 20, * 1987, pp. 53-65, doi:10.1016/0377-0427(87)90125-7. * * * See the following paper for the formula and further discussion: * M. Gagolewski, M. Bartoszuk, A. Cena, * Are cluster validity measures (in)valid?, Information Sciences 581, * 620-636, 2021, DOI:10.1016/j.ins.2021.10.004 */ class SilhouetteIndex : public ClusterValidityIndex { protected: std::vector A; ///< cluster "radius" std::vector B; ///< distance to "nearest" cluster CMatrix C; ///< auxiliary array; Let C(i,j) == sum of ///< distances between X(i,:) and all points in the j-th cluster EuclideanDistance D; ///< D(i, j) gives the Euclidean distance ///< between X(i,:) and X(j,:) /can be precomputed for speed/ bool widths; public: // Described in the base class SilhouetteIndex( const CMatrix& _X, const size_t _K, const bool _allow_undo=false, bool _widths=false) : ClusterValidityIndex(_X, _K, _allow_undo), A(n), B(n), C(n, K), D(&X, n<=CVI_MAX_N_PRECOMPUTE_DISTANCE) { widths = _widths; } // Described in the base class virtual void set_labels(const std::vector& _L) { ClusterValidityIndex::set_labels(_L); // sets L, count and centroids for (size_t i=0; i 1) { // silhouette score of 0 for singleton clusters FLOAT_T cur = (B[i]-A[i])/std::max(B[i], A[i]); ret += cur/(FLOAT_T)count[L[i]]; } else num_singletons++; } ret = ret/(FLOAT_T)(K-num_singletons); } else { for (size_t i=0; i 1) { // silhouette score of 0 for singleton clusters FLOAT_T cur = (B[i]-A[i])/std::max(B[i], A[i]); ret += cur; } } ret = ret/(FLOAT_T)n; } GENIECLUST_ASSERT(std::fabs(ret) < 1.0+1e-12); return ret; } }; #endif genieclust/src/cvi_dunnowa.h0000644000176200001440000002113714634041051015672 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_DUNNOWA_H #define __CVI_DUNNOWA_H #include "cvi.h" #include "c_argfuns.h" #define OWA_ERROR 0 #define OWA_MEAN 1 #define OWA_CONST 666 #define OWA_MIN 2 #define OWA_MAX 3 #define OWA_SMIN_START 100000 #define OWA_SMIN_LIMIT 199999 #define OWA_SMAX_START 200000 #define OWA_SMAX_LIMIT 299999 int DuNNOWA_get_OWA(std::string owa_name) { if (owa_name == "Mean") return OWA_MEAN; else if (owa_name == "Min") return OWA_MIN; else if (owa_name == "Max") return OWA_MAX; else if (owa_name == "Const") return OWA_CONST; else if (owa_name.substr(0, 5) == "SMin:") { int delta = std::atoi(owa_name.substr(5).c_str()); GENIECLUST_ASSERT(delta > 0 && delta < OWA_SMIN_LIMIT-OWA_SMIN_START); return OWA_SMIN_START+delta; } else if (owa_name.substr(0, 5) == "SMax:") { int delta = std::atoi(owa_name.substr(5).c_str()); GENIECLUST_ASSERT(delta > 0 && delta < OWA_SMAX_LIMIT-OWA_SMAX_START); return OWA_SMAX_START+delta; } else { return OWA_ERROR; }; } #define REC_SQRT_2_PI 0.3989422804014326779399460599343818684758586311649346576659258296 FLOAT_T dnorm(FLOAT_T x, FLOAT_T m, FLOAT_T s) { return REC_SQRT_2_PI*exp(-0.5*square((x-m)/s))/s; } /** OWA-based Dunn-like Indices Based on Near Neighbours * * * Proposed by Gagolewski * * See the following paper for the formula and further discussion: * M. Gagolewski, M. Bartoszuk, A. Cena, * Are cluster validity measures (in)valid?, Information Sciences 581, * 620-636, 2021, DOI:10.1016/j.ins.2021.10.004 * * Inspired by generalised Dunn indexes: * * J.C. Dunn, A fuzzy relative of the ISODATA process and its use in detecting * Compact Well-Separated Clusters, Journal of Cybernetics 3(3), 1974, * pp. 32-57, doi:10.1080/01969727308546046. */ class DuNNOWAIndex : public NNBasedIndex { protected: const int owa_numerator; const int owa_denominator; std::vector order; std::vector pq; ///< for SMin and SMax - aux storage of size 3*delta FLOAT_T aggregate(int owa, bool same_cluster) { if (owa == OWA_MEAN) { FLOAT_T ret = 0.0; size_t count = 0; for (size_t i=0; i=0; --j) { /* yep, a signed type */ // if ((bool)same_cluster == (bool)(L[i] == L[ind(i, j)])) { // if (dist(i, j) > ret) // ret = dist(i, j); // break; // dist(i, :) is sorted increasingly // } // } // } // return ret; for (Py_ssize_t u=n*M-1; u>=0; --u) { /* yep, a signed type */ Py_ssize_t i = order[u]/M; Py_ssize_t j = order[u]%M; if ((bool)same_cluster == (bool)(L[i] == L[ind(i, j)])) { return dist(i, j); } } return -INFTY; } else if (owa == OWA_CONST) { return 1.0; } else if (owa > OWA_SMIN_START && owa <= OWA_SMIN_LIMIT) { Py_ssize_t delta = owa-OWA_SMIN_START; Py_ssize_t pq_cur = 0; for (size_t u=0; u OWA_SMAX_START && owa <= OWA_SMAX_LIMIT) { Py_ssize_t delta = owa-OWA_SMAX_START; Py_ssize_t pq_cur = 0; for (Py_ssize_t u=n*M-1; u>=0; --u) { /* yep, a signed type */ Py_ssize_t i = order[u]/M; Py_ssize_t j = order[u]%M; if ((bool)same_cluster == (bool)(L[i] == L[ind(i, j)])) { pq[pq_cur++] = dist(i, j); if (pq_cur == 3*delta) break; } } if (pq_cur == 0) return INFTY; FLOAT_T sum_wx = 0.0, sum_w = 0.0; for (Py_ssize_t u=0; u& _X, const size_t _K, const bool _allow_undo=false, const size_t _M=10, const int _owa_numerator=OWA_MIN, const int _owa_denominator=OWA_MAX ) : NNBasedIndex(_X, _K, _allow_undo, _M), owa_numerator(_owa_numerator), owa_denominator(_owa_denominator), order(n*M) { // Rprintf("%d_%d_%d\n", M, owa_numerator, owa_denominator); int delta = 0; if (owa_numerator > OWA_SMIN_START && owa_numerator <= OWA_SMIN_LIMIT) { delta = std::max(delta, owa_numerator-OWA_SMIN_START); } else if (owa_numerator > OWA_SMAX_START && owa_numerator <= OWA_SMAX_LIMIT) { delta = std::max(delta, owa_numerator-OWA_SMAX_START); } if (owa_denominator > OWA_SMIN_START && owa_denominator <= OWA_SMIN_LIMIT) { delta = std::max(delta, owa_denominator-OWA_SMIN_START); } else if (owa_denominator > OWA_SMAX_START && owa_denominator <= OWA_SMAX_LIMIT) { delta = std::max(delta, owa_denominator-OWA_SMAX_START); } pq = std::vector(3*delta); Cargsort(order.data(), dist.data(), n*M); } virtual FLOAT_T compute() { for (size_t i=0; i * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #include "c_inequality.h" #include #include //' @title Inequality Measures //' //' @description //' \code{gini_index()} gives the normalised Gini index, //' \code{bonferroni_index()} implements the Bonferroni index, and //' \code{devergottini_index()} implements the De Vergottini index. //' //' @details //' These indices can be used to quantify the "inequality" of a numeric sample. //' They can be conceived as normalised measures of data dispersion. //' For constant vectors (perfect equity), the indices yield values of 0. //' Vectors with all elements but one equal to 0 (perfect inequality), //' are assigned scores of 1. //' They follow the Pigou-Dalton principle (are Schur-convex): //' setting \eqn{x_i = x_i - h} and \eqn{x_j = x_j + h} with \eqn{h > 0} //' and \eqn{x_i - h \geq x_j + h} (taking from the "rich" and //' giving to the "poor") decreases the inequality //' //' These indices have applications in economics, amongst others. //' The Genie clustering algorithm uses the Gini index as a measure //' of the inequality of cluster sizes. //' //' //' The normalised Gini index is given by: //' \deqn{ //' G(x_1,\dots,x_n) = \frac{ //' \sum_{i=1}^{n} (n-2i+1) x_{\sigma(n-i+1)} //' }{ //' (n-1) \sum_{i=1}^n x_i //' }, //' } //' //' The normalised Bonferroni index is given by: //' \deqn{ //' B(x_1,\dots,x_n) = \frac{ //' \sum_{i=1}^{n} (n-\sum_{j=1}^i \frac{n}{n-j+1}) //' x_{\sigma(n-i+1)} //' }{ //' (n-1) \sum_{i=1}^n x_i //' }. //' } //' //' The normalised De Vergottini index is given by: //' \deqn{ //' V(x_1,\dots,x_n) = //' \frac{1}{\sum_{i=2}^n \frac{1}{i}} \left( //' \frac{ \sum_{i=1}^n \left( \sum_{j=i}^{n} \frac{1}{j}\right) //' x_{\sigma(n-i+1)} }{\sum_{i=1}^{n} x_i} - 1 //' \right). //' } //' //' Here, \eqn{\sigma} is an ordering permutation of \eqn{(x_1,\dots,x_n)}. //' //' Time complexity: \eqn{O(n)} for sorted (increasingly) data. //' Otherwise, the vector will be sorted. //' //' //' @references //' Bonferroni C., \emph{Elementi di Statistica Generale}, Libreria Seber, //' Firenze, 1930. //' //' Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and //' outlier-resistant hierarchical clustering algorithm, //' \emph{Information Sciences} 363, 2016, pp. 8-23. //' \doi{10.1016/j.ins.2016.05.003} //' //' Gini C., \emph{Variabilita e Mutabilita}, //' Tipografia di Paolo Cuppini, Bologna, 1912. //' //' //' @param x numeric vector of non-negative values //' //' @return The value of the inequality index, a number in \eqn{[0, 1]}. //' //' @examples //' gini_index(c(2, 2, 2, 2, 2)) # no inequality //' gini_index(c(0, 0, 10, 0, 0)) # one has it all //' gini_index(c(7, 0, 3, 0, 0)) # give to the poor, take away from the rich //' gini_index(c(6, 0, 3, 1, 0)) # (a.k.a. Pigou-Dalton principle) //' bonferroni_index(c(2, 2, 2, 2, 2)) //' bonferroni_index(c(0, 0, 10, 0, 0)) //' bonferroni_index(c(7, 0, 3, 0, 0)) //' bonferroni_index(c(6, 0, 3, 1, 0)) //' devergottini_index(c(2, 2, 2, 2, 2)) //' devergottini_index(c(0, 0, 10, 0, 0)) //' devergottini_index(c(7, 0, 3, 0, 0)) //' devergottini_index(c(6, 0, 3, 1, 0)) //' //' @name inequality //' @rdname inequality //' @export // [[Rcpp::export]] double gini_index(Rcpp::NumericVector x) { Py_ssize_t n = x.size(); // check if sorted; if not, sort. for (Py_ssize_t i=1; i x[i]) { x = Rcpp::clone(x); std::sort(x.begin(), x.end()); break; } } return Cgini_sorted(REAL(SEXP(x)), n); } //' @rdname inequality //' @export // [[Rcpp::export]] double bonferroni_index(Rcpp::NumericVector x) { Py_ssize_t n = x.size(); // check if sorted; if not, sort. for (Py_ssize_t i=1; i x[i]) { x = Rcpp::clone(x); std::sort(x.begin(), x.end()); break; } } return Cbonferroni_sorted(REAL(SEXP(x)), n); } //' @rdname inequality //' @export // [[Rcpp::export]] double devergottini_index(Rcpp::NumericVector x) { Py_ssize_t n = x.size(); // check if sorted; if not, sort. for (Py_ssize_t i=1; i x[i]) { x = Rcpp::clone(x); std::sort(x.begin(), x.end()); break; } } return Cdevergottini_sorted(REAL(SEXP(x)), n); } genieclust/src/cvi_generalized_dunn_lowercase_d1.h0000644000176200001440000001055414634041051022165 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * Copyleft (C) 2020, Maciej Bartoszuk * * For the 'genieclust' version: * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_GENERALIZED_DUNN_LOWERCASE_D1_H #define __CVI_GENERALIZED_DUNN_LOWERCASE_D1_H #include "cvi.h" #include "cvi_generalized_dunn_delta.h" class LowercaseDelta1 : public LowercaseDelta { protected: CMatrix dist; /**< intra-cluster distances: dist(i,j) = min( X(u,), X(v,) ), X(u,) in C_i, X(v,) in C_j (i!=j) */ CMatrix last_dist; ///< for undo() bool last_chg; ///< for undo() (was dist changed at all?) bool needs_recompute; ///< for before and after modify std::function< bool(FLOAT_T, FLOAT_T) > comparator; public: LowercaseDelta1( EuclideanDistance& D, const CMatrix& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr ) : LowercaseDelta(D, X, L, count,K,n,d,centroids), dist(K, K), last_dist(K, K) { comparator = std::less(); } virtual void before_modify(size_t i, Py_ssize_t j) { needs_recompute = false; for (size_t u=0; u& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr) { return new LowercaseDelta1(D, X, L, count, K, n, d, centroids); } }; #endif genieclust/src/c_scipy_rectangular_lsap.h0000644000176200001440000001776314311561503020430 0ustar liggesusers/* This file is adapted from scipy/scipy/optimize/rectangular_lsap/rectangular_lsap.cpp (version last updated on 5 Mar 2020; c050fd9) See https://github.com/scipy/scipy/ and https://scipy.org/scipylib/. This code implements the shortest augmenting path algorithm for the rectangular assignment problem. This implementation is based on the pseudocode described in pages 1685-1686 of: Crouse D.F., On implementing 2D rectangular assignment algorithms, *IEEE Transactions on Aerospace and Electronic Systems* **52**(4), 2016, pp. 1679-1696, doi:10.1109/TAES.2016.140952. Original author: Peter M. Larsen (https://github.com/pmla/). Thanks!!! Keywords: the Hungarian algorithm, Kuhn-Munkres algorithm, a modified Jonker-Volgenant algorithm. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #ifndef __c_scipy_rectangular_lsap_h #define __c_scipy_rectangular_lsap_h #include "c_common.h" #include #include #include Py_ssize_t __augmenting_path( Py_ssize_t nc, std::vector& cost, std::vector& u, std::vector& v, std::vector& path, std::vector& row4col, std::vector& shortestPathCosts, Py_ssize_t i, std::vector& SR, std::vector& SC, double* p_minVal); /** * Solves the 2D rectangular assignment problem * using the algorithm described in * * The procedure is adapted from * scipy/scipy/optimize/rectangular_lsap/rectangular_lsap.cpp * (version last updated on 5 Mar; c050fd9) * See https://github.com/scipy/scipy/ and https://scipy.org/scipylib/. * Author: P.M. Larsen. * * * * References * ========== * * Crouse D.F., On implementing 2D rectangular assignment algorithms, * *IEEE Transactions on Aerospace and Electronic Systems* **52**(4), 2016, * pp. 1679-1696, doi:10.1109/TAES.2016.140952. * * * @param C c_contiguous cost matrix; shape nr*nc * @param nr number of rows in C * @param nc number of cols in C, nc>=nr * @param output_col4row [output] c_contiguous vector of length nr; * (i, output_col4row[i]) gives location of the * nr items in C with the smallest sum. * @param minimise false if we seek the maximum * * @return 0 on success */ template Py_ssize_t linear_sum_assignment( T1* C, Py_ssize_t nr, Py_ssize_t nc, T2* output_col4row, bool minimise=true) { if (nr > nc) throw std::domain_error("nr > nc"); // build a non-negative cost matrix std::vector cost(nr * nc); if (minimise) { double minval = *std::min_element(C, C + nr * nc); for (Py_ssize_t i = 0; i < nr * nc; i++) { cost[i] = C[i] - minval; } } else { double maxval = *std::max_element(C, C + nr * nc); for (Py_ssize_t i = 0; i < nr * nc; i++) { cost[i] = maxval-C[i]; } } // initialize variables std::vector u(nr, 0); std::vector v(nc, 0); std::vector shortestPathCosts(nc); std::vector path(nc, -1); std::vector col4row(nr, -1); std::vector row4col(nc, -1); std::vector SR(nr); std::vector SC(nc); // iteratively build the solution for (Py_ssize_t curRow = 0; curRow < nr; curRow++) { double minVal; Py_ssize_t sink = __augmenting_path(nc, cost, u, v, path, row4col, shortestPathCosts, curRow, SR, SC, &minVal); if (sink < 0) { return -1; } // update dual variables u[curRow] += minVal; for (Py_ssize_t i = 0; i < nr; i++) { if (SR[i] && i != curRow) { u[i] += minVal - shortestPathCosts[col4row[i]]; } } for (Py_ssize_t j = 0; j < nc; j++) { if (SC[j]) { v[j] -= minVal - shortestPathCosts[j]; } } // augment previous solution Py_ssize_t j = sink; while (1) { Py_ssize_t i = path[j]; row4col[j] = i; std::swap(col4row[i], j); if (i == curRow) { break; } } } for (Py_ssize_t i = 0; i < nr; i++) { output_col4row[i] = (T2)col4row[i]; } return 0; } Py_ssize_t __augmenting_path( Py_ssize_t nc, std::vector& cost, std::vector& u, std::vector& v, std::vector& path, std::vector& row4col, std::vector& shortestPathCosts, Py_ssize_t i, std::vector& SR, std::vector& SC, double* p_minVal) { double minVal = 0; // Crouse's pseudocode uses set complements to keep track of remaining // nodes. Here we use a vector, as it is more efficient in C++. Py_ssize_t num_remaining = nc; std::vector remaining(nc); for (Py_ssize_t it = 0; it < nc; it++) { // Filling this up in reverse order ensures that the solution of a // constant cost matrix is the identity matrix (c.f. #11602). remaining[it] = nc - it - 1; } std::fill(SR.begin(), SR.end(), false); std::fill(SC.begin(), SC.end(), false); std::fill(shortestPathCosts.begin(), shortestPathCosts.end(), INFINITY); // find shortest augmenting path Py_ssize_t sink = -1; while (sink == -1) { Py_ssize_t index = -1; double lowest = INFINITY; SR[i] = true; for (Py_ssize_t it = 0; it < num_remaining; it++) { Py_ssize_t j = remaining[it]; double r = minVal + cost[i * nc + j] - u[i] - v[j]; if (r < shortestPathCosts[j]) { path[j] = i; shortestPathCosts[j] = r; } // When multiple nodes have the minimum cost, we select one which // gives us a new sink node. This is particularly important for // cost matrices with small coefficients. if (shortestPathCosts[j] < lowest || (shortestPathCosts[j] == lowest && row4col[j] == -1)) { lowest = shortestPathCosts[j]; index = it; } } minVal = lowest; Py_ssize_t j = remaining[index]; if (minVal == INFINITY) { // infeasible cost matrix return -1; } if (row4col[j] == -1) { sink = j; } else { i = row4col[j]; } SC[j] = true; remaining[index] = remaining[--num_remaining]; remaining.resize(num_remaining); } *p_minVal = minVal; return sink; } #endif genieclust/src/Makevars0000644000176200001440000000013314431101205014663 0ustar liggesusersPKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) -DGENIECLUST_R PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS) genieclust/src/cvi_generalized_dunn_uppercase_d2.h0000644000176200001440000000730414634041051022170 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * Copyleft (C) 2020, Maciej Bartoszuk * * For the 'genieclust' version: * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_GENERALIZED_DUNN_UPPERCASE_D2_H #define __CVI_GENERALIZED_DUNN_UPPERCASE_D2_H #include "cvi.h" #include "cvi_generalized_dunn_delta.h" class UppercaseDelta2 : public UppercaseDelta { protected: std::vector dist_sums; ///< sum of points distances to centroid: std::vector last_dist_sums; ///< for undo() bool last_chg; ///< for undo() (was dist changed at all?) public: UppercaseDelta2( EuclideanDistance& D, const CMatrix& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr ) : UppercaseDelta(D,X,L,count,K,n,d,centroids), dist_sums(K), last_dist_sums(K), last_chg(false) { } virtual void before_modify(size_t i, Py_ssize_t j) { for (size_t u=0; u& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr) { return new UppercaseDelta2(D, X, L, count, K, n, d, centroids); } }; #endif genieclust/src/c_compare_partitions.h0000644000176200001440000004507114650457063017601 0ustar liggesusers/* External Cluster Validity Measures * * Adjusted- and Nonadjusted Rand Score, * Adjusted- and Nonadjusted Fowlkes-Mallows Score, * Adjusted-, Normalised and Nonadjusted Mutual Information Score, * Normalised Accuracy, Pair Sets Index * (for vectors of "small" ints) * * * References * ========== * * Hubert L., Arabie P., Comparing Partitions, * Journal of Classification 2(1), 1985, pp. 193-218, esp. Eqs. (2) and (4) * * Vinh N.X., Epps J., Bailey J., * Information theoretic measures for clusterings comparison: * Variants, properties, normalization and correction for chance, * Journal of Machine Learning Research 11, 2010, pp. 2837-2854. * * Rezaei M., Franti P., Set matching measures for external cluster validity, * IEEE Transactions on Knowledge and Data Mining 28(8), 2016, pp. 2173-2186, * doi:10.1109/TKDE.2016.2551240 * * * Copyleft (C) 2018-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_compare_partitions_h #define __c_compare_partitions_h #include "c_common.h" #include "c_scipy_rectangular_lsap.h" #include #include #include /*! (t choose 2) * * @param t * @return t*(t-1.0)*0.5 */ inline double Ccomb2(double t) { return t*(t-1.0)*0.5; } /*! Computes both the minimum and the maximum in an array * * @param x c_contiguous array of length n * @param n length of x * @param xmin [out] the minimum of x * @param xmax [out] the maximum of x */ template void Cminmax(const T* x, Py_ssize_t n, T* xmin, T* xmax) { *xmin = x[0]; *xmax = x[0]; for (Py_ssize_t i=1; i *xmax) *xmax = x[i]; } } /*! * Stores AR and FM scores as well as their adjusted/normalised versions. */ struct CComparePartitionsPairsResult { double ar; double r; double fm; double afm; }; /*! * Stores mutual information-based scores */ struct CComparePartitionsInfoResult { double mi; double nmi; double ami; }; /*! * Stores normalised set-matching scores */ struct CCompareSetMatchingResult { double psi_unclipped; double spsi_unclipped; }; /*! Normalising permutation for the columns of a confusion matrix * * Determines the reordering of columns in a given confusion matrix * so that the sum of the elements on the main diagonal is the largest * possible (by solving the maximal assignment problem). * * Comes in handy if C summarises the results generated * by clustering algorithms, where actual label values do not matter * (e.g., (1, 2, 0) can be remapped to (0, 2, 1) with no change in meaning. * * * @param C a c_contiguous confusion matrix of size xc*yc * @param xc number of rows in C * @param yc number of columns in C; xc <= yc * @param Iout [out] output sequence of length yc * * Note that Iout is modified in-place (overwritten). */ template void Cnormalizing_permutation( const T1* C, Py_ssize_t xc, Py_ssize_t yc, T2* Iout ) { GENIECLUST_ASSERT(xc <= yc); std::vector column_used(yc, false); Py_ssize_t retval = linear_sum_assignment( C, xc, yc, Iout, /*minimise*/false ); GENIECLUST_ASSERT(retval == 0); // only Iout[0]..Iout[xc-1] are set Py_ssize_t i; for (i=0; i void Capply_pivoting( const T* C, Py_ssize_t xc, Py_ssize_t yc, T* Cout/*, bool use_sum=false*/ ) { GENIECLUST_ASSERT(xc <= yc); // if (use_sum) { std::vector output_col4row(yc); Cnormalizing_permutation(C, xc, yc, /*retval*/output_col4row.data()); Py_ssize_t i; for (i=0; i void Ccontingency_table(O* Cout, Py_ssize_t xc, Py_ssize_t yc, T xmin, T ymin, const T* x, const T* y, Py_ssize_t n) { for (Py_ssize_t j=0; j (x[i]-xmin)*yc +(y[i]-ymin)); Cout[(x[i]-xmin)*yc +(y[i]-ymin)] += 1; } } /*! Computes the adjusted and nonadjusted Rand- and FM scores * based on a given confusion matrix. * * References * ========== * * Hubert L., Arabie P., Comparing Partitions, * Journal of Classification 2(1), 1985, pp. 193-218, esp. Eqs. (2) and (4) * * @param C a c_contiguous confusion matrix of size xc*yc * @param xc number of rows in C * @param yc number of columns in C * * @return the computed scores */ template CComparePartitionsPairsResult Ccompare_partitions_pairs(const T* C, Py_ssize_t xc, Py_ssize_t yc) { double n = 0.0; // total sum (length of the underlying x and y = number of points) for (Py_ssize_t ij=0; ij CComparePartitionsInfoResult Ccompare_partitions_info(const T* C, Py_ssize_t xc, Py_ssize_t yc) { double n = 0.0; // total sum (length of the underlying x and y = number of points) for (Py_ssize_t ij=0; ij sum_x(xc); std::vector sum_y(yc); double h_x = 0.0, h_y = 0.0, h_x_cond_y = 0.0, h_x_y = 0.0; for (Py_ssize_t i=0; i 0) h_x_y += C[i*yc+j]*std::log((double)C[i*yc+j]/(double)n); t += C[i*yc+j]; } sum_x[i] = t; if (t > 0) h_y += t*std::log((double)t/(double)n); } for (Py_ssize_t j=0; j 0) h_x_cond_y += C[i*yc+j]*std::log((double)C[i*yc+j]/sum_x[i]); t += C[i*yc+j]; } sum_y[j] = t; if (t > 0) h_x += t*std::log((double)t/(double)n); } h_x = -h_x/(double)n; h_y = -h_y/(double)n; h_x_cond_y = -h_x_cond_y/(double)n; h_x_y = -h_x_y/(double)n; double e_mi = 0.0; for (Py_ssize_t i=0; i double Ccompare_partitions_npa(const T* C, Py_ssize_t xc, Py_ssize_t yc) { double n = 0.0; // total sum (length of the underlying x and y = number of points) for (Py_ssize_t ij=0; ij 0) { n += C[ij]; } } // if C is not a square matrix, treat the missing columns // as if they were filled with 0s Py_ssize_t xyc = std::max(xc, yc); std::vector S(xyc*xyc, 0.0); for (Py_ssize_t i=0; i 0) { S[i*xyc+j] = (double)C[i*yc+j]; } } } std::vector output_col4row(xyc); Py_ssize_t retval = linear_sum_assignment(S.data(), xyc, xyc, output_col4row.data(), false); // minimise=false GENIECLUST_ASSERT(retval == 0); // sum of pivots: double t = 0.0; for (Py_ssize_t i=0; i double Ccompare_partitions_nca(const T* C, Py_ssize_t xc, Py_ssize_t yc) { std::vector sum_x(xc, 0.0); for (Py_ssize_t i=0; i 0) { sum_x[i] += C[i*yc+j]; } } } // if xc>yc, treat C as if its missing columns were filled with 0s Py_ssize_t yc2 = std::max(xc, yc); // if xc S(xc*yc2, 0.0); for (Py_ssize_t i=0; i 0) { S[i*yc2+j] = (double)C[i*yc+j]/(double)sum_x[i]; } } } std::vector output_col4row2(xc); Py_ssize_t retval = linear_sum_assignment(S.data(), xc, yc2, output_col4row2.data(), false); // minimise=false GENIECLUST_ASSERT(retval == 0); // sum of pivots double t = 0.0; for (Py_ssize_t i=0; i CCompareSetMatchingResult Ccompare_partitions_psi( const T* C, Py_ssize_t xc, Py_ssize_t yc ) { double n = 0.0; // total sum (length of the underlying x and y = number of points) for (Py_ssize_t ij=0; ij 0) { n += C[ij]; } } // If C is not a square matrix, treat the missing columns or rows // as if they were filled with 0s. Py_ssize_t xyc = std::max(xc, yc); std::vector sum_x(xyc, 0.0); std::vector sum_y(xyc, 0.0); for (Py_ssize_t i=0; i 0) { sum_x[i] += C[i*yc+j]; sum_y[j] += C[i*yc+j]; } } } std::vector S(xyc*xyc, 0.0); for (Py_ssize_t i=0; i 0) { S[i*xyc+j] = (double)C[i*yc+j]/(double)std::max(sum_x[i], sum_y[j]); } } } std::vector output_col4row2(xyc); Py_ssize_t retval = linear_sum_assignment(S.data(), xyc, xyc, output_col4row2.data(), false); // minimise=false GENIECLUST_ASSERT(retval == 0); // // sum of pivots: // double s = 0.0; // for (Py_ssize_t i=0; i pivots(xyc, 0.0); for (Py_ssize_t i=0; i sum_x[i]) es += sum_x[i]; else es += sum_y[i]; } es /= (double)n; CCompareSetMatchingResult res; // PSI uses max(0, PSI_unclipped) res.psi_unclipped = (s-es)/(xyc-es); res.spsi_unclipped = (s-1.0)/(xyc-1.0); return res; } #endif genieclust/src/cvi_calinski_harabasz.h0000644000176200001440000001147014634041051017666 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_CALINSKI_HARABASZ_H #define __CVI_CALINSKI_HARABASZ_H #include "cvi.h" /** The Calinski-Harabasz Index (Variance Ratio Criterion) * * See Eq.(3) in (Calinski, Harabasz, 1974) * * T. Calinski, J. Harabasz, A dendrite method for cluster analysis, * Communications in Statistics, 3(1), 1974, pp. 1-27, * doi:10.1080/03610927408827101. * * See the following paper for the formula and further discussion: * M. Gagolewski, M. Bartoszuk, A. Cena, * Are cluster validity measures (in)valid?, Information Sciences 581, * 620-636, 2021, DOI:10.1016/j.ins.2021.10.004 */ class CalinskiHarabaszIndex : public CentroidsBasedIndex { protected: std::vector centroid; ///< the centroid of the whole X, size d FLOAT_T numerator; ///< sum of intra-cluster squared L2 distances FLOAT_T denominator; ///< sum of within-cluster squared L2 distances FLOAT_T last_numerator; ///< for undo() FLOAT_T last_denominator; ///< for undo() public: // Described in the base class CalinskiHarabaszIndex( const CMatrix& _X, const size_t _K, const bool _allow_undo=false) : CentroidsBasedIndex(_X, _K, _allow_undo), centroid(d, 0.0) { // centroid[i,j] == 0.0 already // compute the centroid of the whole dataset for (size_t i=0; i& _L) { CentroidsBasedIndex::set_labels(_L); // sets L, count and centroids // sum of intra-cluster squared L2 distances numerator = 0.0; for (size_t i=0; i * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_distance_h #define __c_distance_h #include "c_common.h" #include #include #ifdef _OPENMP #include #endif template inline T square(T x) { return x*x; } /*! Abstract base class for all distances */ template struct CDistance { virtual ~CDistance() {} /*! * @param i point index, 0<=i struct CDistancePrecomputedMatrix : public CDistance { const T* dist; Py_ssize_t n; /*! * @param dist n*n c_contiguous array, dist[i,j] is the distance between * the i-th and the j-th point, the matrix is symmetric * @param n number of points */ CDistancePrecomputedMatrix(const T* dist, Py_ssize_t n) { this->n = n; this->dist = dist; } CDistancePrecomputedMatrix() : CDistancePrecomputedMatrix(NULL, 0) { } virtual const T* operator()(Py_ssize_t i, const Py_ssize_t* /*M*/, Py_ssize_t /*k*/) { return &this->dist[i*n]; // the i-th row of dist } }; /*! A class to "compute" the distances from the i-th point * to all n points based on a pre-computed a vector-form * c_contiguous distance vector. */ template struct CDistancePrecomputedVector : public CDistance { const T* dist; Py_ssize_t n; std::vector buf; /*! * @param dist n*(n-1)/2 c_contiguous vector, dist[ i*n - i*(i+1)/2 + w-i-1 ] * where w is the distance between the i-th and the w-th point * @param n number of points */ CDistancePrecomputedVector(const T* dist, Py_ssize_t n) : buf(n) { this->n = n; this->dist = dist; } CDistancePrecomputedVector() : CDistancePrecomputedVector(NULL, 0) { } virtual const T* operator()(Py_ssize_t i, const Py_ssize_t* M, Py_ssize_t k) { T* __buf = buf.data(); for (Py_ssize_t j=0; j struct CDistanceEuclidean : public CDistance { const T* X; Py_ssize_t n; Py_ssize_t d; std::vector buf; /*! * @param X n*d c_contiguous array * @param n number of points * @param d dimensionality */ CDistanceEuclidean(const T* X, Py_ssize_t n, Py_ssize_t d) : buf(n) { this->n = n; this->d = d; this->X = X; } CDistanceEuclidean() : CDistanceEuclidean(NULL, 0, 0) { } virtual const T* operator()(Py_ssize_t i, const Py_ssize_t* M, Py_ssize_t k) { T* __buf = buf.data(); const T* x = X+d*i; #ifdef _OPENMP #pragma omp parallel for schedule(static) #endif for (Py_ssize_t j=0; j struct CDistanceEuclideanSquared : public CDistance { const T* X; Py_ssize_t n; Py_ssize_t d; std::vector buf; /*! * @param X n*d c_contiguous array * @param n number of points * @param d dimensionality */ CDistanceEuclideanSquared(const T* X, Py_ssize_t n, Py_ssize_t d) : buf(n) { this->n = n; this->d = d; this->X = X; } CDistanceEuclideanSquared() : CDistanceEuclideanSquared(NULL, 0, 0) { } virtual const T* operator()(Py_ssize_t i, const Py_ssize_t* M, Py_ssize_t k) { T* __buf = buf.data(); const T* x = X+d*i; #ifdef _OPENMP #pragma omp parallel for schedule(static) #endif for (Py_ssize_t j=0; j struct CDistanceManhattan : public CDistance { const T* X; Py_ssize_t n; Py_ssize_t d; std::vector buf; /*! * @param X n*d c_contiguous array * @param n number of points * @param d dimensionality */ CDistanceManhattan(const T* X, Py_ssize_t n, Py_ssize_t d) : buf(n) { this->n = n; this->d = d; this->X = X; } CDistanceManhattan() : CDistanceManhattan(NULL, 0, 0) { } virtual const T* operator()(Py_ssize_t i, const Py_ssize_t* M, Py_ssize_t k) { T* __buf = buf.data(); #ifdef _OPENMP #pragma omp parallel for schedule(static) #endif for (Py_ssize_t j=0; j=0 && w struct CDistanceCosine : public CDistance { const T* X; Py_ssize_t n; Py_ssize_t d; std::vector buf; std::vector norm; /*! * @param X n*d c_contiguous array * @param n number of points * @param d dimensionality */ CDistanceCosine(const T* X, Py_ssize_t n, Py_ssize_t d) : buf(n), norm(n) { this->n = n; this->d = d; this->X = X; T* __norm = norm.data(); #ifdef _OPENMP #pragma omp parallel for schedule(static) #endif for (Py_ssize_t i=0; i=0&&w struct CDistanceMutualReachability : public CDistance { Py_ssize_t n; CDistance* d_pairwise; std::vector buf; std::vector d_core; CDistanceMutualReachability(const T* _d_core, Py_ssize_t n, CDistance* d_pairwise) : buf(n), d_core(_d_core, _d_core+n) { this->n = n; this->d_pairwise = d_pairwise; } CDistanceMutualReachability() : CDistanceMutualReachability(NULL, 0, NULL) { } virtual const T* operator()(Py_ssize_t i, const Py_ssize_t* M, Py_ssize_t k) { // pragma omp parallel for inside:: const T* d = (*d_pairwise)(i, M, k); T* __buf = buf.data(); #ifdef _OPENMP #pragma omp parallel for schedule(static) #endif for (Py_ssize_t j=0; j __buf[w]) __buf[w] = d_core[i]; if (d_core[w] > __buf[w]) __buf[w] = d_core[w]; } } return __buf; } }; #endif genieclust/src/cvi_generalized_dunn.h0000644000176200001440000001456214634041051017540 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * Copyleft (C) 2020, Maciej Bartoszuk * * For the 'genieclust' version: * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_GENERALIZED_DUNN_H #define __CVI_GENERALIZED_DUNN_H #include "cvi.h" #include "cvi_generalized_dunn_delta.h" /** Dunn's index for measuring the degree to which clusters are * compact and well-separated * * The index is defined by Eq.(3) in (Dunn, 1974). * * J.C. Dunn, A fuzzy relative of the ISODATA process and its use in detecting * Compact Well-Separated Clusters, Journal of Cybernetics 3(3), 1974, * pp. 32-57, doi:10.1080/01969727308546046. * * See the following paper for the formula and further discussion: * M. Gagolewski, M. Bartoszuk, A. Cena, * Are cluster validity measures (in)valid?, Information Sciences 581, * 620-636, 2021, DOI:10.1016/j.ins.2021.10.004 */ class GeneralizedDunnIndex : public ClusterValidityIndex { protected: EuclideanDistance D; ///< squared Euclidean LowercaseDelta* numeratorDelta; UppercaseDelta* denominatorDelta; public: // Described in the base class GeneralizedDunnIndex( const CMatrix& _X, const size_t _K, LowercaseDeltaFactory* numeratorDeltaFactory, UppercaseDeltaFactory* denominatorDeltaFactory, const bool _allow_undo=false) : ClusterValidityIndex(_X, _K, _allow_undo), D(&X, n<=CVI_MAX_N_PRECOMPUTE_DISTANCE, true/*squared*/), numeratorDelta(numeratorDeltaFactory->create(D, X, L, count, K, n, d)), denominatorDelta(denominatorDeltaFactory->create(D, X, L, count, K, n, d)) { } ~GeneralizedDunnIndex() { delete numeratorDelta; delete denominatorDelta; } // Described in the base class virtual void set_labels(const std::vector& _L) { ClusterValidityIndex::set_labels(_L); // sets L, count and centroids numeratorDelta->recompute_all(); denominatorDelta->recompute_all(); } // Described in the base class virtual void modify(size_t i, Py_ssize_t j) { numeratorDelta->before_modify(i, j); denominatorDelta->before_modify(i, j); // sets L[i]=j and updates count as well as centroids ClusterValidityIndex::modify(i, j); numeratorDelta->after_modify(i, j); denominatorDelta->after_modify(i, j); } // Described in the base class virtual void undo() { numeratorDelta->undo(); denominatorDelta->undo(); ClusterValidityIndex::undo(); } // Described in the base class virtual FLOAT_T compute() { FLOAT_T max_denominator = 0.0; FLOAT_T min_numerator = INFTY; for (size_t i=0; icompute(i); if (denom_i > max_denominator) max_denominator = denom_i; for (size_t j=i+1; jcompute(i, j); if (num_ij < min_numerator) min_numerator = num_ij; } } // remember to do sqrt in deltas! return min_numerator/max_denominator; } }; class GeneralizedDunnIndexCentroidBased : public CentroidsBasedIndex { protected: EuclideanDistance D; ///< squared Euclidean LowercaseDelta* numeratorDelta; UppercaseDelta* denominatorDelta; public: // Described in the base class GeneralizedDunnIndexCentroidBased( const CMatrix& _X, const size_t _K, LowercaseDeltaFactory* numeratorDeltaFactory, UppercaseDeltaFactory* denominatorDeltaFactory, const bool _allow_undo=false) : CentroidsBasedIndex(_X, _K, _allow_undo), D(&X, n<=CVI_MAX_N_PRECOMPUTE_DISTANCE, true/*squared*/), numeratorDelta(numeratorDeltaFactory->create(D, X, L, count, K, n, d, ¢roids)), denominatorDelta(denominatorDeltaFactory->create(D, X, L, count, K, n, d, ¢roids)) { } ~GeneralizedDunnIndexCentroidBased() { delete numeratorDelta; delete denominatorDelta; } // Described in the base class virtual void set_labels(const std::vector& _L) { CentroidsBasedIndex::set_labels(_L); // sets L, count and centroids numeratorDelta->recompute_all(); denominatorDelta->recompute_all(); } // Described in the base class virtual void modify(size_t i, Py_ssize_t j) { numeratorDelta->before_modify(i, j); denominatorDelta->before_modify(i, j); // sets L[i]=j and updates count as well as centroids CentroidsBasedIndex::modify(i, j); numeratorDelta->after_modify(i, j); denominatorDelta->after_modify(i, j); } // Described in the base class virtual void undo() { numeratorDelta->undo(); denominatorDelta->undo(); CentroidsBasedIndex::undo(); } // Described in the base class virtual FLOAT_T compute() { FLOAT_T max_denominator = 0.0; FLOAT_T min_numerator = INFTY; for (size_t i=0; icompute(i); if (denom_i > max_denominator) max_denominator = denom_i; for (size_t j=i+1; jcompute(i, j); if (num_ij < min_numerator) min_numerator = num_ij; } } // remember to do sqrt in deltas! return min_numerator/max_denominator; } }; #endif genieclust/src/c_int_dict.h0000644000176200001440000002545014634041214015460 0ustar liggesusers/* class CIntDict * * Copyleft (C) 2018-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_int_dict_h #define __c_int_dict_h #include "c_common.h" #include #include #include /*! ordered_map (dictionary) for keys in {0,1,...,n-1} (small ints). * Elements are stored in the natural 0 <= 1 <= ... <= n-1 order. * * Most supported operations (except for inserting a new key * in the "middle") are as fast as you can get. * Yet, everything comes at a price: here it's the O(n) memory * requirement, even if data are sparse. * * Use case: GiniDisjointSets in the `genieclust` Python package. */ template class CIntDict { protected: Py_ssize_t n; //!< total number of distinct keys possible Py_ssize_t k; //!< number of keys currently stored std::vector tab; //!< tab[i] is the element associated with key i std::vector tab_next; //!< an array-based... std::vector tab_prev; //!< ...doubly-linked list... Py_ssize_t tab_head; //!< ...for quickly accessing and iterating over... Py_ssize_t tab_tail; //!< ...this->tab data public: /*! Constructs an empty container. * * @param n number of elements, n>=0. */ CIntDict(Py_ssize_t n) : tab(n), tab_next(n, n), tab_prev(n, -1) { // if (n < 0) throw std::domain_error("n < 0"); this->n = n; this->k = 0; this->tab_head = n; this->tab_tail = -1; } /*! Constructs a full-size container. * * @param n number of elements, n>=0. * @param val value to replicate at each position */ CIntDict(Py_ssize_t n, const T& val) : tab(n), tab_next(n), tab_prev(n) { // if (n < 0) throw std::domain_error("n < 0"); this->n = n; this->k = n; for (Py_ssize_t i=0; itab_prev[i] = i-1; this->tab_next[i] = i+1; this->tab = val; } this->tab_head = 0; this->tab_tail = n-1; } /*! A nullary constructor allows Cython to allocate * the instances on the stack. Do not use otherwise. */ CIntDict() : CIntDict(0) { } /*! Returns the current number of elements in the container. * * Time complexity: O(1) */ inline Py_ssize_t size() const { return this->k; } /*! Returns the maximum number of elements that the container can hold. */ inline Py_ssize_t max_size() const { return this->n; } /*! Tests whether the container is empty, i.e., its size() is 0. */ inline bool empty() const { return this->k == 0; } /*! Counts the number of elements with given key, i.e., returns 0 or 1 * depending on whether an element with key i exists. * * Time complexity: O(1) * * @param i key in [0,n) */ inline size_t count(Py_ssize_t i) const { if (i < 0 || i >= n) throw std::out_of_range("CIntDict::count key out of range"); return (tab_prev[i]>=0 || tab_next[i] new head and tail tab_head = tab_tail = i; } else if (i < tab_head) { // new head tab_next[i] = tab_head; GENIECLUST_ASSERT(tab_prev[i] == -1); tab_prev[tab_head] = i; tab_head = i; } else if (i > tab_tail) { // new tail tab_next[tab_tail] = i; tab_prev[i] = tab_tail; GENIECLUST_ASSERT(tab_next[i] == n); tab_tail = i; } else { // insert in the "middle" // slow op // TODO skip list, etc. ?? Py_ssize_t elem_before_i = tab_head; while (tab_next[elem_before_i] < i) elem_before_i = tab_next[elem_before_i]; Py_ssize_t elem_after_i = tab_next[elem_before_i]; GENIECLUST_ASSERT(tab_prev[elem_after_i] == elem_before_i); tab_next[i] = elem_after_i; tab_prev[i] = elem_before_i; tab_next[elem_before_i] = i; tab_prev[elem_after_i] = i; } k++; // we have a brand new elem in the storage } return tab[i]; } /*! Removes a single element, provided it exists. * * Time complexity: O(1) * * @param i key in [0,n) * @return the number of elements removed (0 or 1) */ Py_ssize_t erase(Py_ssize_t i) { if (!count(i)) return 0; if (i == tab_head && i == tab_tail) { // that was the last (size-wise) element in the container tab_head = n; tab_tail = -1; } else if (i == tab_head) { // that was the least element tab_head = tab_next[tab_head]; tab_prev[tab_head] = -1; } else if (i == tab_tail) { // that was the largest one tab_tail = tab_prev[tab_tail]; tab_next[tab_tail] = n; } else { // elem in the "middle" Py_ssize_t elem_after_i = tab_next[i]; Py_ssize_t elem_before_i = tab_prev[i]; tab_next[elem_before_i] = elem_after_i; tab_prev[elem_after_i] = elem_before_i; } tab[i] = T(); // force destructor call tab_prev[i] = -1; tab_next[i] = n; k--; return 1; // one element has been removed } Py_ssize_t get_key_min() const { return tab_head; } Py_ssize_t get_key_max() const { return tab_tail; } Py_ssize_t get_key_next(Py_ssize_t i) const { return tab_next[i]; } Py_ssize_t get_key_prev(Py_ssize_t i) const { return tab_prev[i]; } Py_ssize_t pop_key_min() { Py_ssize_t ret = tab_head; erase(ret); return ret; } Py_ssize_t pop_key_max() { Py_ssize_t ret = tab_tail; erase(ret); return ret; } // ------- minimal iterator-based interface ----------------------------- /*! If you want more than merely an input_iterator, * go ahead, implement it and make a pull request :) */ class iterator //: public std::iterator { public: using iterator_category = std::input_iterator_tag; using value_type = Py_ssize_t; // crap using difference_type = void; using pointer = void; using reference = void; private: const Py_ssize_t* tab_next; Py_ssize_t cur; public: iterator(Py_ssize_t tab_head, Py_ssize_t* tab_next) : tab_next(tab_next), cur(tab_head) { } iterator& operator++() { cur = tab_next[cur]; return *this; } iterator operator++(int) { iterator tmp(*this); operator++(); return tmp; } bool operator==(const iterator& rhs) const { return tab_next==rhs.tab_next && cur==rhs.cur; } bool operator!=(const iterator& rhs) const { return tab_next!=rhs.tab_next || cur!=rhs.cur; } Py_ssize_t operator*() const { return cur; } }; /*! Returns an iterator pointing to the element in the container * that has the least key */ iterator begin() { return iterator(tab_head, tab_next.data()); } /*! Returns an iterator pointing to the past-the-end element */ iterator end() { return iterator(n, tab_next.data()); } // TODO /go ahead, write it, make a pull request/ //Returns an iterator pointing to the element in the container // that has the greatest key // reverse_iterator rbegin() // TODO /go ahead, write it, make a pull request/ //Returns an iterator pointing to the past-the-beginning element // reverse_iterator rend() // TODO /go ahead, write it, make a pull request/ // cbegin, cend, crbegin, crend() // TODO /go ahead, write it, make a pull request/ // Returns an iterator to an element with given key or returns an iterator to end() if not exists // iterator find ( const key_type& k ); //const_iterator find ( const key_type& k ) const; // TODO /go ahead, write it, make a pull request/ // Removes a single element from the container //iterator erase ( const_iterator position ); //by position (1) }; #endif genieclust/src/c_preprocess.h0000644000176200001440000000400214634041214016036 0ustar liggesusers/* Graph pre-processing and other routines * * Copyleft (C) 2018-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_preprocess_h #define __c_preprocess_h #include "c_common.h" #include #include #include #include #include "c_gini_disjoint_sets.h" #include "c_int_dict.h" /*! Compute the degree of each vertex in an undirected graph * over vertex set {0,...,n-1} * * @param ind c_contiguous matrix of size num_edges*2, * where {ind[i,0], ind[i,1]} is the i-th edge * with ind[i,j] < n. * Edges with ind[i,0] < 0 or ind[i,1] < 0 are purposely ignored. * @param num_edges number of edges (rows in ind) * @param n number of vertices * @param deg [out] array of size n, where * deg[i] will give the degree of the i-th vertex. */ void Cget_graph_node_degrees( const Py_ssize_t* ind, Py_ssize_t num_edges, Py_ssize_t n, Py_ssize_t* deg) { for (Py_ssize_t i=0; i ignore if (u>=n || v>=n) throw std::domain_error("All elements must be <= n"); if (u == v) throw std::domain_error("Self-loops are not allowed"); deg[u]++; deg[v]++; } } #endif genieclust/src/r_gclust.cpp0000644000176200001440000002612314634041214015534 0ustar liggesusers/* The Genie++ Clustering Algorithm - R Wrapper * * Copyleft (C) 2018-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #include "c_common.h" #include "c_matrix.h" #include "c_distance.h" #include "c_mst.h" #include "c_genie.h" #include "c_postprocess.h" #include using namespace Rcpp; /* This function was originally part of our `genie` package for R */ void internal_generate_merge(Py_ssize_t n, NumericMatrix links, NumericMatrix merge) { std::vector elements(n+1, 0); std::vector parents(n+1, 0); Py_ssize_t clusterNumber = 1; for (Py_ssize_t k=0; k merge(k, 1)) std::swap(merge(k, 0), merge(k, 1)); } } } /* Originally, this function was part of our `genie` package for R */ void internal_generate_order(Py_ssize_t n, NumericMatrix merge, NumericVector order) { std::vector< std::list > relord(n+1); Py_ssize_t clusterNumber = 1; for (Py_ssize_t k=0; k::iterator it = relord[n-1].begin(); it != relord[n-1].end(); ++it) { order[k++] = (*it); } } template NumericMatrix internal_compute_mst(CDistance* D, Py_ssize_t n, Py_ssize_t M, bool verbose) { if (M < 1 || M >= n-1) stop("`M` must be an integer in [1, n-1)"); NumericMatrix ret(n-1, 3); CDistance* D2 = NULL; if (M >= 2) { // yep, we need it for M==2 as well if (verbose) GENIECLUST_PRINT("[genieclust] Determining the core distance.\n"); Py_ssize_t k = M-1; CMatrix nn_i(n, k); CMatrix nn_d(n, k); Cknn_from_complete(D, n, k, nn_d.data(), nn_i.data()); NumericMatrix nn_r(n, k); std::vector d_core(n); for (Py_ssize_t i=0; i(d_core.data(), n, D); } CMatrix mst_i(n-1, 2); std::vector mst_d(n-1); if (verbose) GENIECLUST_PRINT("[genieclust] Computing the MST.\n"); Cmst_from_complete(D2?D2:D, n, mst_d.data(), mst_i.data(), verbose); if (verbose) GENIECLUST_PRINT("[genieclust] Done.\n"); if (D2) delete D2; for (Py_ssize_t i=0; i NumericMatrix internal_mst_default( NumericMatrix X, String distance, Py_ssize_t M, /*bool use_mlpack, */ bool verbose) { Py_ssize_t n = X.nrow(); Py_ssize_t d = X.ncol(); NumericMatrix ret; CMatrix X2(REAL(SEXP(X)), n, d, false); // Fortran- to C-contiguous for (Py_ssize_t i=0; i* D = NULL; if (distance == "euclidean" || distance == "l2") D = (CDistance*)(new CDistanceEuclideanSquared(X2.data(), n, d)); else if (distance == "manhattan" || distance == "cityblock" || distance == "l1") D = (CDistance*)(new CDistanceManhattan(X2.data(), n, d)); else if (distance == "cosine") D = (CDistance*)(new CDistanceCosine(X2.data(), n, d)); else stop("given `distance` is not supported (yet)"); ret = internal_compute_mst(D, n, M, verbose); delete D; if (distance == "euclidean" || distance == "l2") { for (Py_ssize_t i=0; i(X, distance, M, verbose); else return internal_mst_default(X, distance, M, verbose); } // [[Rcpp::export(".mst.dist")]] NumericMatrix dot_mst_dist( NumericVector d, int M=1, bool verbose=false) { Py_ssize_t n = (Py_ssize_t)round((sqrt(1.0+8.0*d.size())+1.0)/2.0); GENIECLUST_ASSERT(n*(n-1)/2 == d.size()); CDistancePrecomputedVector D(REAL(SEXP(d)), n); return internal_compute_mst(&D, n, M, verbose); } // [[Rcpp::export(".genie")]] IntegerVector dot_genie( NumericMatrix mst, int k, double gini_threshold, String postprocess, bool detect_noise, bool verbose) { if (verbose) GENIECLUST_PRINT("[genieclust] Determining clusters.\n"); if (gini_threshold < 0.0 || gini_threshold > 1.0) stop("`gini_threshold` must be in [0, 1]"); if (postprocess == "boundary" && detect_noise && Rf_isNull(mst.attr("nn"))) stop("`nn` attribute of the MST not set; unable to proceed with this postprocessing action"); Py_ssize_t n = mst.nrow()+1; if (k < 1 || k > n) stop("invalid requested number of clusters, `k`"); CMatrix mst_i(n-1, 2); std::vector mst_d(n-1); for (Py_ssize_t i=0; i g(mst_d.data(), mst_i.data(), n, detect_noise); g.apply_genie(k, gini_threshold); if (verbose) GENIECLUST_PRINT("[genieclust] Postprocessing the outputs.\n"); std::vector xres(n); Py_ssize_t k_detected = g.get_labels(k, xres.data()); if (k_detected != k) Rf_warning("Number of clusters detected is different than the requested one due to the presence of noise points."); if (detect_noise && postprocess == "boundary") { NumericMatrix nn_r = mst.attr("nn"); GENIECLUST_ASSERT(nn_r.nrow() == n); Py_ssize_t M = nn_r.ncol()+1; GENIECLUST_ASSERT(M < n); CMatrix nn_i(n, M-1); for (Py_ssize_t i=0; i= 1); GENIECLUST_ASSERT(nn_r(i,j) <= n); nn_i(i,j) = (Py_ssize_t)nn_r(i,j)-1; // 0-based indexing } } Cmerge_boundary_points(mst_i.data(), n-1, nn_i.data(), M-1, M, xres.data(), n); } else if (detect_noise && postprocess == "all") { Cmerge_noise_points(mst_i.data(), n-1, xres.data(), n); } IntegerVector res(n); for (Py_ssize_t i=0; i 1.0) stop("`gini_threshold` must be in [0, 1]"); Py_ssize_t n = mst.nrow()+1; CMatrix mst_i(n-1, 2); std::vector mst_d(n-1); for (Py_ssize_t i=0; i g(mst_d.data(), mst_i.data(), n/*, noise_leaves=M>1*/); g.apply_genie(1, gini_threshold); if (verbose) GENIECLUST_PRINT("[genieclust] Postprocessing the outputs.\n"); std::vector links(n-1); g.get_links(links.data()); NumericMatrix links2(n-1, 2); NumericVector height(n-1, NA_REAL); Py_ssize_t k = 0; for (Py_ssize_t i=0; i= 0) { links2(k, 0) = mst_i(links[i], 0) + 1; links2(k, 1) = mst_i(links[i], 1) + 1; height(k) = mst_d[ links[i] ]; ++k; } } for (; k * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_gini_disjoint_sets_h #define __c_gini_disjoint_sets_h #include "c_common.h" #include "c_disjoint_sets.h" #include "c_int_dict.h" /*! "Augmented" Disjoint Sets (Union-Find) Data Structure * * A class to represent partitions of the set {0,1,...,n-1} for any n. * * Stores the size of every set in the partition. */ class CCountDisjointSets : public CDisjointSets{ protected: std::vector cnt; //!< cnt[find(x)] is the size of the relevant subset public: /*! Starts with a "weak" partition { {0}, {1}, ..., {n-1} }, * i.e., n singletons. * * @param n number of elements, n>=0. */ CCountDisjointSets(Py_ssize_t n) : CDisjointSets(n), cnt(n, 1) // each cluster is of size 1 { ; } /*! A nullary constructor allows Cython to allocate * the instances on the stack. Do not use otherwise. */ CCountDisjointSets() : CCountDisjointSets(0) { } /*! Returns the size of the subset containing x. * * Run time: the cost of find(x) */ Py_ssize_t get_count(Py_ssize_t x) { x = this->find(x); return this->cnt[x]; } /*! Merges the sets containing x and y in {0,...,n-1}. * * Let px be the parent id of x, and py be the parent id of y. * If px < py, then the new parent id of py will be set to py. * Otherwise, px will have py as its parent. * * If x and y are members of the same subset, * an exception is thrown. * * @return the id of the parent of x or y, whichever is smaller. * * @param x a value in {0,...,n-1} * @param y a value in {0,...,n-1} * * Update time: amortised O(1). */ virtual Py_ssize_t merge(Py_ssize_t x, Py_ssize_t y) { // well, union is a reserved C++ keyword :) x = this->find(x); // includes a range check for x y = this->find(y); // includes a range check for y if (x == y) throw std::invalid_argument("find(x) == find(y)"); if (y < x) std::swap(x, y); // DisjointSet's merge part: this->par[y] = x; // update the parent of y this->k -= 1; // decrease the subset count // update the counts this->cnt[x] += this->cnt[y]; // cluster x has more elements now this->cnt[y] = 0; // cluster y, well, cleaning up return x; } }; /*! "Augmented" Disjoint Sets (Union-Find) Data Structure * * A class to represent partitions of the set {0,1,...,n-1} for any n. * * The class allows to compute the normalised Gini index of the distribution * of subset sizes, i.e., * \[ * G(x_1,\dots,x_k) = \frac{ * \sum_{i=1}^{n-1} \sum_{j=i+1}^n |x_i-x_j| * }{ * (n-1) \sum_{i=1}^n x_i * }. * \] * * The merge() operation, which also recomputes the Gini index, * has O(sqrt n) time complexity. * * For a use case, see: Gagolewski M., Bartoszuk M., Cena A., * Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, * Information Sciences 363, 2016, pp. 8-23. doi:10.1016/j.ins.2016.05.003 */ class CGiniDisjointSets : public CCountDisjointSets{ protected: CIntDict number_of_size; /*!< number_of_size[i] gives the number * of subsets of size i (there are at most sqrt(n) possible * non-zero elements) */ double gini; //!< the Gini index of the current subset sizes Py_ssize_t forgotten; /*! Re-compute the normalized Gini index * * based on a formula given in [TODO:derive the formula nicely] */ void recompute_gini() { gini = 0.0; if (number_of_size.size() > 1) { // otherwise all clusters are of identical sizes GENIECLUST_ASSERT(k-forgotten-1 > 0) Py_ssize_t v = number_of_size.get_key_min(); Py_ssize_t i = 0; while (v != number_of_size.get_key_max()) { Py_ssize_t w = v; // previous v v = number_of_size.get_key_next(v); // next v i += number_of_size[w]; // cumulative counts gini += ((double)v-w)*i*((double)k-forgotten-i); } gini /= (double)(n*(k-forgotten-1.0)); // this is the normalised Gini index if (gini > 1.0) gini = 1.0; // account for round-off errors else if (gini < 0.0) gini = 0.0; } } /*! called by merge(x, y) and merge_and_forget(x, y) */ Py_ssize_t merge(Py_ssize_t x, Py_ssize_t y, bool forget) { x = this->find(x); // includes a range check for x y = this->find(y); // includes a range check for y if (x == y) throw std::invalid_argument("find(x) == find(y)"); if (y < x) std::swap(x, y); // DisjointSet's merge part: this->par[y] = x; // update the parent of y this->k -= 1; // decrease the subset count // CCountDisjointSets's merge part: Py_ssize_t size1 = this->cnt[x]; Py_ssize_t size2 = this->cnt[y]; this->cnt[x] += this->cnt[y]; // cluster x has more elements now this->cnt[y] = 0; // cluster y, well, cleaning up //GENIECLUST_ASSERT(number_of_size.at(size1)>0); number_of_size[size1] -= 1; // one cluster of size1 is no more //GENIECLUST_ASSERT(number_of_size.at(size2)>0); number_of_size[size2] -= 1; // one cluster of size2 is an ex-cluster // get rid of size1 and size2, if necessary if (size2 < size1) std::swap(size1, size2); if (number_of_size.at(size1) <= 0) number_of_size.erase(size1); // fast if (size1 != size2 && number_of_size.at(size2) <= 0) number_of_size.erase(size2); // fast if (!forget) { Py_ssize_t size12 = size1+size2; if (number_of_size.count(size12) == 0) number_of_size[size12] = 1; // might be O(sqrt(n)) else number_of_size[size12] += 1; // long live cluster of size1+2 } recompute_gini(); // all done return x; } public: /*! Starts with a "weak" partition { {0}, {1}, ..., {n-1} }, * i.e., n singletons. * * @param n number of elements, n>=0. */ CGiniDisjointSets(Py_ssize_t n) : CCountDisjointSets(n), number_of_size(n+1), forgotten(0) { if (n>0) number_of_size[1] = n; // there are n clusters of size 1 gini = 0.0; // a perfectly balanced cluster size distribution } /*! A nullary constructor allows Cython to allocate * the instances on the stack. Do not use otherwise. */ CGiniDisjointSets() : CGiniDisjointSets(0) { } /*! Returns the Gini index of the distribution of subsets' sizes. * * Run time: O(1), as the Gini index is updated during a call * to merge(). */ double get_gini() const { return this->gini; } /*! Returns the size of the smallest subset. * * Run time: O(1). */ Py_ssize_t get_smallest_count() const { return number_of_size.get_key_min(); /*this->tab_head;*/ } /*! Returns the size of the largest subset. * * Run time: O(1). */ Py_ssize_t get_largest_count() const { return number_of_size.get_key_max(); /*this->tab_tail;*/ } /*! Returns the number of sets of given size * * Run time: O(1). */ Py_ssize_t get_k_of_size(Py_ssize_t c) { return number_of_size[c]; } /*! Determine the Gini index that you would get if x and y * were merged. */ double test_gini_after_merge(Py_ssize_t x, Py_ssize_t y, bool forget) { x = this->find(x); // includes a range check for x y = this->find(y); // includes a range check for y Py_ssize_t size1 = this->cnt[x]; Py_ssize_t size2 = this->cnt[y]; Py_ssize_t size12 = size1+size2; if (!(size1 <= size2)) std::swap(size1, size2); double new_gini = gini*(n)*(k-forgotten-1.0); Py_ssize_t v = number_of_size.get_key_min(); while (true) { Py_ssize_t vc = number_of_size[v]; new_gini -= vc*std::fabs(v-size1); new_gini -= vc*std::fabs(v-size2); if (!forget) new_gini += vc*std::fabs(v-size12); if (v == number_of_size.get_key_max()) break; v = number_of_size.get_key_next(v); // next v } new_gini += std::fabs(size2-size1); if (!forget) { new_gini -= std::fabs(size2-size12); new_gini -= std::fabs(size1-size12); } if (forget) ++forgotten; new_gini /= (n)*(double)(k-1-forgotten-1.0); new_gini = std::min(1.0, std::max(0.0, new_gini)); // avoid numeric inaccuracies return new_gini; } /*! Merges the sets containing x and y in {0,...,n-1}. * * Let px be the parent id of x, and py be the parent id of y. * If px < py, then the new parent id of py will be set to py. * Otherwise, px will have py as its parent. * * If x and y are members of the same subset, * an exception is thrown. * * @return the id of the parent of x or y, whichever is smaller. * * @param x a value in {0,...,n-1} * @param y a value in {0,...,n-1} * * Update time: worst-case amortised O(sqrt(n)). */ virtual Py_ssize_t merge(Py_ssize_t x, Py_ssize_t y) { return merge(x, y, /*forget=*/false); } /*! Merges the sets containing x and y in {0,...,n-1} * * The new cluster will not be taken into account when * computing the Gini index and calling get_smallest_count(). * * * @return the id of the parent of x or y, whichever is smaller. * * @param x a value in {0,...,n-1} * @param y a value in {0,...,n-1} * * Update time: worst-case amortised O(sqrt(n)). */ Py_ssize_t merge_and_forget(Py_ssize_t x, Py_ssize_t y) { ++forgotten; return merge(x, y, /*forget=*/true); } /*! Generates an array of subsets' sizes. * The resulting vector is ordered nondecreasingly. * * Run time: O(k), where k is the current number of subsets. * * This is only valid if merge_and_forget has not been used. * * @param res [out] c_contiguous array of length k */ void get_counts(Py_ssize_t* res) { GENIECLUST_ASSERT(forgotten == 0) Py_ssize_t i = 0; for (CIntDict::iterator it = number_of_size.begin(); it != number_of_size.end(); ++it) { // add this->tab[v] times v for (Py_ssize_t j=0; j * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_postprocess_h #define __c_postprocess_h #include "c_common.h" #include /*! Merge all "boundary" noise points with their nearest "core" points * * * For all the boundary points i, set c[i] = c[j], * where {i,j} is an edge in a spanning forest given by adjacency matrix ind. * * The i-th point is a boundary point if it is a noise point, i.e., c[i] < 0, * and it's amongst j's M-1 nearest neighbours. * * * @param ind c_contiguous matrix of size num_edges*2, * where {ind[i,0], ind[i,1]} specifies the i-th (undirected) edge * in a spanning tree or forest; ind[i,j] < n. * Edges with ind[i,0] < 0 or ind[i,1] < 0 are purposely ignored. * @param num_edges number of rows in ind (edges) * @param nn c_contiguous matrix of size n*num_neighbours; * nn[i,:] gives the indices of the i-th point's * nearest neighbours; -1 indicates a "missing value" * @param num_neighbours number of columns in nn * @param M smoothing factor, 2 <= M < num_neighbours * @param c [in/out] c_contiguous vector of length n, where * c[i] denotes the cluster id * (in {-1, 0, 1, ..., k-1} for some k) of the i-th object, i=0,...,n-1. * Class -1 denotes the `noise' cluster. * @param n length of c and the number of vertices in the spanning forest */ void Cmerge_boundary_points( const Py_ssize_t* ind, Py_ssize_t num_edges, const Py_ssize_t* nn, Py_ssize_t num_neighbours, Py_ssize_t M, Py_ssize_t* c, Py_ssize_t n) { if (M < 2 || M-2 >= num_neighbours) throw std::domain_error("Incorrect smoothing factor M"); for (Py_ssize_t i=0; i ignore if (u>=n || v>=n) throw std::domain_error("All elements must be <= n"); if (c[u] < 0 && c[v] < 0) throw std::domain_error("Edge between two unallocated points detected"); if (c[u] >= 0 && c[v] >= 0) continue; if (c[v] < 0) std::swap(u, v); GENIECLUST_ASSERT(c[u] < 0); // u is marked as a noise point GENIECLUST_ASSERT(c[v] >= 0); // v is a core point // a noise point is not necessarily a boundary point: // u is a boundary point if u is amongst v's M-1 nearest neighbours //c[u] = -1; // it's negative anyway for (Py_ssize_t j=0; j ignore if (u>=n || v>=n) throw std::domain_error("All elements must be <= n"); if (c[u] < 0 && c[v] < 0) throw std::domain_error("An edge between two unallocated points detected"); if (c[u] < 0) c[u] = c[v]; else if (c[v] < 0) c[v] = c[u]; //else // continue; } } #endif genieclust/src/cvi_generalized_dunn_delta.h0000644000176200001440000001024614634041051020704 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * Copyleft (C) 2020, Maciej Bartoszuk * * For the 'genieclust' version: * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_GENERALIZED_DUNN_DELTA_H #define __CVI_GENERALIZED_DUNN_DELTA_H #include "cvi.h" class Delta { protected: EuclideanDistance& D; ///< squared Euclidean const CMatrix& X; //CMatrix& X; ///< data matrix of size n*d std::vector& L; ///< current label vector of size n std::vector& count; ///< size of each of the K clusters size_t K; size_t n; size_t d; CMatrix* centroids; ///< centroids, can be NULL public: Delta( EuclideanDistance& D, const CMatrix& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr ) : D(D), X(X), L(L), count(count), K(K), n(n), d(d), centroids(centroids) { } virtual void before_modify(size_t i, Py_ssize_t j) = 0; virtual void after_modify(size_t i, Py_ssize_t j) = 0; virtual void undo() = 0; virtual void recompute_all() = 0; virtual ~Delta() {} }; class LowercaseDelta : public Delta { public: LowercaseDelta( EuclideanDistance& D, const CMatrix& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr ) : Delta(D,X,L,count,K,n,d,centroids) { } virtual FLOAT_T compute(size_t k, size_t l) = 0; virtual ~LowercaseDelta() {} }; class UppercaseDelta : public Delta { public: UppercaseDelta( EuclideanDistance& D, const CMatrix& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr ) : Delta(D,X,L,count,K,n,d,centroids) { } virtual FLOAT_T compute(size_t k) = 0; virtual ~UppercaseDelta() {} }; class DeltaFactory { public: virtual bool IsCentroidNeeded() = 0; virtual ~DeltaFactory() {} }; class LowercaseDeltaFactory : public DeltaFactory { public: // cannot be in DeltaFactory since result type is different, even if parameter list is the same virtual LowercaseDelta* create(EuclideanDistance& D, const CMatrix& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr) = 0; // static LowercaseDeltaFactory* GetSpecializedFactory(std::string lowercaseDeltaName); }; class UppercaseDeltaFactory : public DeltaFactory { public: // cannot be in DeltaFactory since result type is different, even if parameter list is the same virtual UppercaseDelta* create(EuclideanDistance& D, const CMatrix& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr) = 0; // static UppercaseDeltaFactory* GetSpecializedFactory(std::string uppercaseDeltaName); }; #endif genieclust/src/cvi_generalized_dunn_lowercase_d6.h0000644000176200001440000001555714634041051022202 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * Copyleft (C) 2020, Maciej Bartoszuk * * For the 'genieclust' version: * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_GENERALIZED_DUNN_LOWERCASE_D6_H #define __CVI_GENERALIZED_DUNN_LOWERCASE_D6_H #include "cvi.h" #include "cvi_generalized_dunn_delta.h" class LowercaseDelta6 : public LowercaseDelta { protected: CMatrix dist; /**< intra-cluster distances: dist(i,j) = min( X(u,), X(v,) ), X(u,) in C_i, X(v,) in C_j (i!=j) */ CMatrix last_dist; ///< for undo() std::vector min_dists; ///< helper for calculating minimum distances to clusters for a single point bool last_chg; ///< for undo() (was dist changed at all?) bool needs_recompute; ///< for before and after modify Py_ssize_t cluster1; Py_ssize_t cluster2; public: LowercaseDelta6( EuclideanDistance& D, const CMatrix& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr ) : LowercaseDelta(D, X, L, count,K,n,d,centroids), dist(K, K), last_dist(K, K), min_dists(K) { } virtual void before_modify(size_t i, Py_ssize_t j) { needs_recompute = false; for (size_t u=0; u& X, std::vector& L, std::vector& count, size_t K, size_t n, size_t d, CMatrix* centroids=nullptr) { return new LowercaseDelta6(D, X, L, count, K, n, d, centroids); } }; #endif genieclust/src/cvi_wcss.h0000644000176200001440000000510214634041051015170 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_WCSS_H #define __CVI_WCSS_H #include "cvi.h" /** Negated Within-Cluster Sum of Squares and the Ball-Hall Index * * The Ball-Hall index is weighted by the cluster cardinality (weighted=true). * * WCSS is the objective function used, amongst others, in the k-means and * the Ward and Calinski&Harabasz algorithms. * * G.H. Ball, D.J. Hall, * ISODATA: A novel method of data analysis and pattern classification, * Technical report No. AD699616, Stanford Research Institute, 1965. * * T. Calinski, J. Harabasz, A dendrite method for cluster analysis, * Communications in Statistics, 3(1), 1974, pp. 1-27, * doi:10.1080/03610927408827101. * * * See the following paper for the formula and further discussion: * M. Gagolewski, M. Bartoszuk, A. Cena, * Are cluster validity measures (in)valid?, Information Sciences 581, * 620-636, 2021, DOI:10.1016/j.ins.2021.10.004 */ class WCSSIndex : public CentroidsBasedIndex { protected: bool weighted; ///< false for WCSS, true for the Ball-Hall index public: // Described in the base class WCSSIndex( const CMatrix& _X, const size_t _K, const bool _allow_undo=false, bool _weighted=false) : CentroidsBasedIndex(_X, _K, _allow_undo) { weighted = _weighted; } // Described in the base class virtual FLOAT_T compute() { // sum of within-cluster squared L2 distances FLOAT_T wcss = 0.0; for (size_t i=0; i * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #include "c_compare_partitions.h" #include #include #include using namespace Rcpp; /** Extract or compute the contingency matrix based on given arguments * * @param x vector or contingency table (matrix) * @param y R_NilValue or vector of size x.size() if x is not a matrix * @param xc [out] * @param yc [out] * * @return flat, contiguous c_style vector representing the contingency table * with xc rows and yc columns */ std::vector get_contingency_matrix( RObject x, RObject y, Py_ssize_t* xc, Py_ssize_t* yc ) { if (Rf_isMatrix(x)) { if (!Rf_isNull(y)) stop("if x is a contingency matrix, y must be NULL"); if (!(Rf_isInteger(x) | Rf_isReal(x))) stop("x must be of type numeric"); NumericMatrix X(x); *xc = X.nrow(); *yc = X.ncol(); std::vector C((*xc)*(*yc)); Py_ssize_t k=0; for (Py_ssize_t i=0; i<*xc; ++i) for (Py_ssize_t j=0; j<*yc; ++j) C[k++] = X(i, j); // Fortran -> C-style return C; } else { if (Rf_isNull(y)) stop("if x is not a contingency matrix, y must not be NULL"); if (!(Rf_isInteger(x) | Rf_isReal(x) | Rf_isLogical(x) | Rf_isFactor(x))) stop("x must be of type numeric"); if (!(Rf_isInteger(x) | Rf_isReal(x) | Rf_isLogical(x) | Rf_isFactor(x))) stop("y must be of type numeric"); IntegerVector rx(x); IntegerVector ry(y); Py_ssize_t n = rx.size(); if (ry.size() != n) stop("x and y must be of equal lengths"); for (Py_ssize_t i=0; i C((*xc)*(*yc)); Ccontingency_table(C.data(), *xc, *yc, xmin, ymin, INTEGER(SEXP(rx)), INTEGER(SEXP(ry)), n); return C; } } //' @title External Cluster Validity Measures and Pairwise Partition Similarity Scores //' //' @description //' The functions described in this section quantify the similarity between //' two label vectors \code{x} and \code{y} which represent two partitions //' of a set of \eqn{n} elements into, respectively, \eqn{K} and \eqn{L} //' nonempty and pairwise disjoint subsets. //' //' For instance, \code{x} and \code{y} can represent two clusterings //' of a dataset with \eqn{n} observations specified by two vectors //' of labels. The functions described here can be used as external cluster //' validity measures, where we assume that \code{x} is //' a reference (ground-truth) partition whilst \code{y} is the vector //' of predicted cluster memberships. //' //' All indices except \code{normalized_clustering_accuracy()} //' can act as a pairwise partition similarity score: they are symmetric, //' i.e., \code{index(x, y) == index(y, x)}. //' //' Each index except \code{mi_score()} (which computes the mutual //' information score) outputs 1 given two identical partitions. //' Note that partitions are always defined up to a permutation (bijection) //' of the set of possible labels, e.g., (1, 1, 2, 1) and (4, 4, 2, 4) //' represent the same 2-partition. //' //' @details //' \code{normalized_clustering_accuracy()} (Gagolewski, 2023) //' is an asymmetric external cluster validity measure //' which assumes that the label vector \code{x} (or rows in the confusion //' matrix) represents the reference (ground truth) partition. //' It is an average proportion of correctly classified points in each cluster //' above the worst case scenario of uniform membership assignment, //' with cluster ID matching based on the solution to the maximal linear //' sum assignment problem; see \code{\link{normalized_confusion_matrix}}). //' It is given by: //' \eqn{\max_\sigma \frac{1}{K} \sum_{j=1}^K \frac{c_{\sigma(j), j}-c_{\sigma(j),\cdot}/K}{c_{\sigma(j),\cdot}-c_{\sigma(j),\cdot}/K}}, //' where \eqn{C} is a confusion matrix with \eqn{K} rows and \eqn{L} columns, //' \eqn{\sigma} is a permutation of the set \eqn{\{1,\dots,\max(K,L)\}}, and //' \eqn{c_{i, \cdot}=c_{i, 1}+...+c_{i, L}} is the i-th row sum, //' under the assumption that \eqn{c_{i,j}=0} for \eqn{i>K} or \eqn{j>L} //' and \eqn{0/0=0}. //' //' \code{normalized_pivoted_accuracy()} is defined as //' \eqn{(\max_\sigma \sum_{j=1}^{\max(K,L)} c_{\sigma(j),j}/n-1/\max(K,L))/(1-1/\max(K,L))}, //' where \eqn{\sigma} is a permutation of the set \eqn{\{1,\dots,\max(K,L)\}}, //' and \eqn{n} is the sum of all elements in \eqn{C}. //' For non-square matrices, missing rows/columns are assumed //' to be filled with 0s. //' //' \code{pair_sets_index()} (PSI) was introduced in (Rezaei, Franti, 2016). //' The simplified PSI assumes E=1 in the definition of the index, //' i.e., uses Eq. (20) in the said paper instead of Eq. (18). //' For non-square matrices, missing rows/columns are assumed //' to be filled with 0s. //' //' \code{rand_score()} gives the Rand score (the "probability" of agreement //' between the two partitions) and //' \code{adjusted_rand_score()} is its version corrected for chance, //' see (Hubert, Arabie, 1985): its expected value is 0 given two independent //' partitions. Due to the adjustment, the resulting index may be negative //' for some inputs. //' //' Similarly, \code{fm_score()} gives the Fowlkes-Mallows (FM) score //' and \code{adjusted_fm_score()} is its adjusted-for-chance version; //' see (Hubert, Arabie, 1985). //' //' \code{mi_score()}, \code{adjusted_mi_score()} and //' \code{normalized_mi_score()} are information-theoretic //' scores, based on mutual information, //' see the definition of \eqn{AMI_{sum}} and \eqn{NMI_{sum}} //' in (Vinh et al., 2010). //' //' //' \code{normalized_confusion_matrix()} computes the confusion matrix //' and permutes its rows and columns so that the sum of the elements //' of the main diagonal is the largest possible (by solving //' the maximal assignment problem). //' The function only accepts \eqn{K \leq L}. //' The reordering of the columns of a confusion matrix can be determined //' by calling \code{normalizing_permutation()}. //' //' Also note that the built-in //' \code{\link{table}()} determines the standard confusion matrix. //' //' //' @references //' Gagolewski M., A framework for benchmarking clustering algorithms, //' \emph{SoftwareX} 20, 2022, 101270, //' \doi{10.1016/j.softx.2022.101270}, //' \url{https://clustering-benchmarks.gagolewski.com}. //' //' Gagolewski M., Normalised clustering accuracy: An asymmetric external //' cluster validity measure, \emph{Journal of Classification}, 2024, in press, //' \doi{10.1007/s00357-024-09482-2}. //' //' Hubert L., Arabie P., Comparing partitions, //' \emph{Journal of Classification} 2(1), 1985, 193-218, esp. Eqs. (2) and (4). //' //' Meila M., Heckerman D., An experimental comparison of model-based clustering //' methods, \emph{Machine Learning} 42, 2001, pp. 9-29, //' \doi{10.1023/A:1007648401407}. //' //' Rezaei M., Franti P., Set matching measures for external cluster validity, //' \emph{IEEE Transactions on Knowledge and Data Mining} 28(8), 2016, //' 2173-2186. //' //' Steinley D., Properties of the Hubert-Arabie adjusted Rand index, //' \emph{Psychological Methods} 9(3), 2004, pp. 386-396, //' \doi{10.1037/1082-989X.9.3.386}. //' //' Vinh N.X., Epps J., Bailey J., //' Information theoretic measures for clusterings comparison: //' Variants, properties, normalization and correction for chance, //' \emph{Journal of Machine Learning Research} 11, 2010, 2837-2854. //' //' //' //' @param x an integer vector of length n (or an object coercible to) //' representing a K-partition of an n-set (e.g., a reference partition), //' or a confusion matrix with K rows and L columns //' (see \code{\link{table}(x, y)}) //' //' @param y an integer vector of length n (or an object coercible to) //' representing an L-partition of the same set (e.g., the output of a //' clustering algorithm we wish to compare with \code{x}), //' or NULL (if x is an K*L confusion matrix) //' //' @param simplified whether to assume E=1 in the definition of the pair sets index index, //' i.e., use Eq. (20) in (Rezaei, Franti, 2016) instead of Eq. (18) //' //' @param clipped whether the result should be clipped to the unit interval, i.e., [0, 1] //' //' //' @return Each cluster validity measure is a single numeric value. //' //' \code{normalized_confusion_matrix()} returns a numeric matrix. //' //' \code{normalizing_permutation()} returns a vector of indexes. //' //' //' @examples //' y_true <- iris[[5]] //' y_pred <- kmeans(as.matrix(iris[1:4]), 3)$cluster //' normalized_clustering_accuracy(y_true, y_pred) //' normalized_pivoted_accuracy(y_true, y_pred) //' pair_sets_index(y_true, y_pred) //' pair_sets_index(y_true, y_pred, simplified=TRUE) //' adjusted_rand_score(y_true, y_pred) //' rand_score(table(y_true, y_pred)) # the same //' adjusted_fm_score(y_true, y_pred) //' fm_score(y_true, y_pred) //' mi_score(y_true, y_pred) //' normalized_mi_score(y_true, y_pred) //' adjusted_mi_score(y_true, y_pred) //' normalized_confusion_matrix(y_true, y_pred) //' normalizing_permutation(y_true, y_pred) //' //' @rdname compare_partitions //' @name compare_partitions //' @export //[[Rcpp::export]] double normalized_clustering_accuracy(RObject x, RObject y=R_NilValue) { Py_ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); return Ccompare_partitions_nca(C.data(), xc, yc); } //' @rdname compare_partitions //' @export //[[Rcpp::export]] double normalized_pivoted_accuracy(RObject x, RObject y=R_NilValue) { Py_ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); return Ccompare_partitions_npa(C.data(), xc, yc); } //' @rdname compare_partitions //' @export //[[Rcpp::export]] double pair_sets_index(RObject x, RObject y=R_NilValue, bool simplified=false, bool clipped=true) { Py_ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); double res; if (simplified) res = Ccompare_partitions_psi(C.data(), xc, yc).spsi_unclipped; else res = Ccompare_partitions_psi(C.data(), xc, yc).psi_unclipped; // Rezaei&Franti use clipped=true in their paper if (clipped) res = std::max(0.0, std::min(1.0, res)); return res; } //' @rdname compare_partitions //' @export //[[Rcpp::export]] double adjusted_rand_score(RObject x, RObject y=R_NilValue, bool clipped=false) { Py_ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); double res = Ccompare_partitions_pairs(C.data(), xc, yc).ar; if (clipped) res = std::max(0.0, std::min(1.0, res)); return res; } //' @rdname compare_partitions //' @export //[[Rcpp::export]] double rand_score(RObject x, RObject y=R_NilValue) { Py_ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); return Ccompare_partitions_pairs(C.data(), xc, yc).r; } //' @rdname compare_partitions //' @export //[[Rcpp::export]] double adjusted_fm_score(RObject x, RObject y=R_NilValue, bool clipped=false) { Py_ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); double res = Ccompare_partitions_pairs(C.data(), xc, yc).afm; if (clipped) res = std::max(0.0, std::min(1.0, res)); return res; } //' @rdname compare_partitions //' @export //[[Rcpp::export]] double fm_score(RObject x, RObject y=R_NilValue) { Py_ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); return Ccompare_partitions_pairs(C.data(), xc, yc).fm; } //' @rdname compare_partitions //' @export //[[Rcpp::export]] double mi_score(RObject x, RObject y=R_NilValue) { Py_ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); return Ccompare_partitions_info(C.data(), xc, yc).mi; } //' @rdname compare_partitions //' @export //[[Rcpp::export]] double normalized_mi_score(RObject x, RObject y=R_NilValue) { Py_ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); return Ccompare_partitions_info(C.data(), xc, yc).nmi; } //' @rdname compare_partitions //' @export //[[Rcpp::export]] double adjusted_mi_score(RObject x, RObject y=R_NilValue, bool clipped=false) { Py_ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); double res = Ccompare_partitions_info(C.data(), xc, yc).ami; if (clipped) res = std::max(0.0, std::min(1.0, res)); return res; } //' @rdname compare_partitions //' @export //[[Rcpp::export]] NumericMatrix normalized_confusion_matrix(RObject x, RObject y=R_NilValue) { Py_ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); std::vector C_out_Corder(xc*yc); Capply_pivoting(C.data(), xc, yc, C_out_Corder.data()); NumericMatrix Cout(xc, yc); for (Py_ssize_t i=0; i C( get_contingency_matrix(x, y, &xc, &yc) ); IntegerVector Iout(yc); Cnormalizing_permutation(C.data(), xc, yc, INTEGER(SEXP(Iout))); for (Py_ssize_t j=0; j 1-based return Iout; } genieclust/src/r_cvi.cpp0000644000176200001440000003160414650456346015032 0ustar liggesusers/* Rcpp exports - Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #include #include #include #include #include "c_common.h" #include "c_matrix.h" #include "cvi.h" #include "cvi_calinski_harabasz.h" #include "cvi_davies_bouldin.h" #include "cvi_silhouette.h" // #include "cvi_dunn.h" // #include "cvi_gamma.h" #include "cvi_wcss.h" #include "cvi_wcnn.h" #include "cvi_dunnowa.h" #include "cvi_generalized_dunn.h" #include "cvi_generalized_dunn_lowercase_d1.h" #include "cvi_generalized_dunn_lowercase_d2.h" #include "cvi_generalized_dunn_lowercase_d3.h" #include "cvi_generalized_dunn_lowercase_d4.h" #include "cvi_generalized_dunn_lowercase_d5.h" #include "cvi_generalized_dunn_lowercase_d6.h" #include "cvi_generalized_dunn_uppercase_d1.h" #include "cvi_generalized_dunn_uppercase_d2.h" #include "cvi_generalized_dunn_uppercase_d3.h" using namespace Rcpp; /** Converts a 1-based label vector to a 0-based vector of small integers. * * @param x numeric vector with integer elements * @param K [out] the number of clusters * @return vector */ std::vector translateLabels_fromR(const Rcpp::NumericVector& x, Py_ssize_t& K) { size_t n = x.size(); std::vector ret(n); K = 0; for (size_t i=0; i= 1."); ret[i] = (Py_ssize_t)(xi-1); // 1-based -> 0-based if (K < xi) K = xi; // determine the max(x) } return ret; } //' @title Internal Cluster Validity Measures //' //' @description //' Implementation of a number of so-called cluster validity indices critically //' reviewed in (Gagolewski, Bartoszuk, Cena, 2021). See Section 2 //' therein and (Gagolewski, 2022) for the respective definitions. //' //' The greater the index value, the more \emph{valid} (whatever that means) //' the assessed partition. For consistency, the Ball-Hall and //' Davies-Bouldin indexes as well as the within-cluster sum of squares (WCSS) //' take negative values. //' //' //' @param X numeric matrix with \code{n} rows and \code{d} columns, //' representing \code{n} points in a \code{d}-dimensional space //' //' @param y vector of \code{n} integer labels, //' representing a partition whose \emph{quality} is to be //' assessed; \code{y[i]} is the cluster ID of the \code{i}-th point, //' \code{X[i, ]}; \code{1 <= y[i] <= K}, where \code{K} is the number //' or clusters //' //' @param M number of nearest neighbours //' //' @param lowercase_d an integer between 1 and 5, denoting //' \eqn{d_1}, ..., \eqn{d_5} in the definition //' of the generalised Dunn (Bezdek-Pal) index (numerator: //' min, max, and mean pairwise intracluster distance, //' distance between cluster centroids, //' weighted point-centroid distance, respectively) //' //' @param uppercase_d an integer between 1 and 3, denoting //' \eqn{D_1}, ..., \eqn{D_3} in the definition //' of the generalised Dunn (Bezdek-Pal) index (denominator: //' max and min pairwise intracluster distance, average point-centroid //' distance, respectively) //' //' @param owa_numerator,owa_denominator single string specifying //' the OWA operators to use in the definition of the DuNN index; //' one of: \code{"Mean"}, \code{"Min"}, \code{"Max"}, \code{"Const"}, //' \code{"SMin:D"}, \code{"SMax:D"}, where \code{D} is an integer //' defining the degree of smoothness //' //' //' @return //' A single numeric value (the more, the \emph{better}). //' //' @references //' Ball G.H., Hall D.J., //' \emph{ISODATA: A novel method of data analysis and pattern classification}, //' Technical report No. AD699616, Stanford Research Institute, 1965. //' //' Bezdek J., Pal N., Some new indexes of cluster validity, //' \emph{IEEE Transactions on Systems, Man, and Cybernetics, Part B} 28, //' 1998, 301-315, \doi{10.1109/3477.678624}. //' //' Calinski T., Harabasz J., A dendrite method for cluster analysis, //' \emph{Communications in Statistics} 3(1), 1974, 1-27, //' \doi{10.1080/03610927408827101}. //' //' Davies D.L., Bouldin D.W., //' A Cluster Separation Measure, //' \emph{IEEE Transactions on Pattern Analysis and Machine Intelligence} //' PAMI-1 (2), 1979, 224-227, \doi{10.1109/TPAMI.1979.4766909}. //' //' Dunn J.C., A Fuzzy Relative of the ISODATA Process and Its Use in Detecting //' Compact Well-Separated Clusters, \emph{Journal of Cybernetics} 3(3), 1973, //' 32-57, \doi{10.1080/01969727308546046}. //' //' Gagolewski M., Bartoszuk M., Cena A., //' Are cluster validity measures (in)valid?, \emph{Information Sciences} 581, //' 620-636, 2021, \doi{10.1016/j.ins.2021.10.004}; //' preprint: \url{https://raw.githubusercontent.com/gagolews/bibliography/master/preprints/2021cvi.pdf}. //' //' Gagolewski M., A Framework for Benchmarking Clustering Algorithms, //' \emph{SoftwareX} 20, 2022, 101270, //' \doi{10.1016/j.softx.2022.101270}, //' \url{https://clustering-benchmarks.gagolewski.com}. //' //' Rousseeuw P.J., Silhouettes: A Graphical Aid to the Interpretation and //' Validation of Cluster Analysis, \emph{Computational and Applied Mathematics} //' 20, 1987, 53-65, \doi{10.1016/0377-0427(87)90125-7}. //' //' //' //' @examples //' X <- as.matrix(iris[,1:4]) //' X[,] <- jitter(X) # otherwise we get a non-unique solution //' y <- as.integer(iris[[5]]) //' calinski_harabasz_index(X, y) # good //' calinski_harabasz_index(X, sample(1:3, nrow(X), replace=TRUE)) # bad //' //' @name cluster_validity //' @rdname cluster_validity //' @export // [[Rcpp::export]] double calinski_harabasz_index(NumericMatrix X, NumericVector y) { Py_ssize_t K; std::vector _y = translateLabels_fromR(y, /*out*/K); CMatrix _X(REAL(SEXP(X)), X.nrow(), X.ncol(), false); if (_X.nrow() < 1 || _X.nrow() != _y.size()) Rf_error("Incompatible X and y"); CalinskiHarabaszIndex ind(_X, (Py_ssize_t)K); ind.set_labels(_y); return (double)ind.compute(); } //' @rdname cluster_validity //' @export // [[Rcpp::export]] double dunnowa_index(NumericMatrix X, NumericVector y, int M=25, Rcpp::String owa_numerator="SMin:5", Rcpp::String owa_denominator="Const") { Py_ssize_t K; std::vector _y = translateLabels_fromR(y, /*out*/K); CMatrix _X(REAL(SEXP(X)), X.nrow(), X.ncol(), false); if (_X.nrow() < 1 || _X.nrow() != _y.size()) Rf_error("Incompatible X and y"); if (M <= 0) // M = min(n-1, M) in the constructor Rf_error("M must be positive."); int _owa_numerator = DuNNOWA_get_OWA(std::string(owa_numerator)); int _owa_denominator = DuNNOWA_get_OWA(std::string(owa_denominator)); if (_owa_numerator == OWA_ERROR || _owa_denominator == OWA_ERROR) { Rf_error("invalid OWA operator specifier"); } DuNNOWAIndex ind(_X, (Py_ssize_t)K, false, M, _owa_numerator, _owa_denominator); ind.set_labels(_y); return (double)ind.compute(); } //' @rdname cluster_validity //' @export // [[Rcpp::export]] double generalised_dunn_index(NumericMatrix X, NumericVector y, int lowercase_d, int uppercase_d) { Py_ssize_t K; std::vector _y = translateLabels_fromR(y, /*out*/K); CMatrix _X(REAL(SEXP(X)), X.nrow(), X.ncol(), false); if (_X.nrow() < 1 || _X.nrow() != _y.size()) Rf_error("Incompatible X and y"); LowercaseDeltaFactory* lowercase_deltaFactory; UppercaseDeltaFactory* uppercase_deltaFactory; if (lowercase_d == 1) { lowercase_deltaFactory = new LowercaseDelta1Factory(); } else if (lowercase_d == 2) { lowercase_deltaFactory = new LowercaseDelta2Factory(); } else if (lowercase_d == 3) { lowercase_deltaFactory = new LowercaseDelta3Factory(); } else if (lowercase_d == 4) { lowercase_deltaFactory = new LowercaseDelta4Factory(); } else if (lowercase_d == 5) { lowercase_deltaFactory = new LowercaseDelta5Factory(); } else if (lowercase_d == 6) { lowercase_deltaFactory = new LowercaseDelta6Factory(); } else { Rf_error("invalid lowercase_d"); } if (uppercase_d == 1) { uppercase_deltaFactory = new UppercaseDelta1Factory(); } else if (uppercase_d == 2) { uppercase_deltaFactory = new UppercaseDelta2Factory(); } else if (uppercase_d == 3) { uppercase_deltaFactory = new UppercaseDelta3Factory(); } else { Rf_error("invalid uppercase_d"); } bool areCentroidsNeeded = ( lowercase_deltaFactory->IsCentroidNeeded() || uppercase_deltaFactory->IsCentroidNeeded() ); if (areCentroidsNeeded) { GeneralizedDunnIndexCentroidBased ind(_X, (Py_ssize_t)K, lowercase_deltaFactory, uppercase_deltaFactory); delete lowercase_deltaFactory; delete uppercase_deltaFactory; ind.set_labels(_y); return (double)ind.compute(); } else { GeneralizedDunnIndex ind(_X, (Py_ssize_t)K, lowercase_deltaFactory, uppercase_deltaFactory); delete lowercase_deltaFactory; delete uppercase_deltaFactory; ind.set_labels(_y); return (double)ind.compute(); } } //' @rdname cluster_validity //' @export // [[Rcpp::export]] double negated_ball_hall_index(NumericMatrix X, NumericVector y) { Py_ssize_t K; std::vector _y = translateLabels_fromR(y, /*out*/K); CMatrix _X(REAL(SEXP(X)), X.nrow(), X.ncol(), false); if (_X.nrow() < 1 || _X.nrow() != _y.size()) Rf_error("Incompatible X and y"); WCSSIndex ind(_X, (Py_ssize_t)K, false, true/*weighted*/); ind.set_labels(_y); return (double)ind.compute(); } //' @rdname cluster_validity //' @export // [[Rcpp::export]] double negated_davies_bouldin_index(NumericMatrix X, NumericVector y) { Py_ssize_t K; std::vector _y = translateLabels_fromR(y, /*out*/K); CMatrix _X(REAL(SEXP(X)), X.nrow(), X.ncol(), false); if (_X.nrow() < 1 || _X.nrow() != _y.size()) Rf_error("Incompatible X and y"); DaviesBouldinIndex ind(_X, (Py_ssize_t)K); ind.set_labels(_y); return (double)ind.compute(); } //' @rdname cluster_validity //' @export // [[Rcpp::export]] double negated_wcss_index(NumericMatrix X, NumericVector y) { Py_ssize_t K; std::vector _y = translateLabels_fromR(y, /*out*/K); CMatrix _X(REAL(SEXP(X)), X.nrow(), X.ncol(), false); if (_X.nrow() < 1 || _X.nrow() != _y.size()) Rf_error("Incompatible X and y"); WCSSIndex ind(_X, (Py_ssize_t)K, false, false/*not weighted*/); ind.set_labels(_y); return (double)ind.compute(); } //' @rdname cluster_validity //' @export // [[Rcpp::export]] double silhouette_index(NumericMatrix X, NumericVector y) { Py_ssize_t K; std::vector _y = translateLabels_fromR(y, /*out*/K); CMatrix _X(REAL(SEXP(X)), X.nrow(), X.ncol(), false); if (_X.nrow() < 1 || _X.nrow() != _y.size()) Rf_error("Incompatible X and y"); SilhouetteIndex ind(_X, (Py_ssize_t)K, false, false); ind.set_labels(_y); return (double)ind.compute(); } //' @rdname cluster_validity //' @export // [[Rcpp::export]] double silhouette_w_index(NumericMatrix X, NumericVector y) { Py_ssize_t K; std::vector _y = translateLabels_fromR(y, /*out*/K); CMatrix _X(REAL(SEXP(X)), X.nrow(), X.ncol(), false); if (_X.nrow() < 1 || _X.nrow() != _y.size()) Rf_error("Incompatible X and y"); SilhouetteIndex ind(_X, (Py_ssize_t)K, false, true); ind.set_labels(_y); return (double)ind.compute(); } //' @rdname cluster_validity //' @export // [[Rcpp::export]] double wcnn_index(NumericMatrix X, NumericVector y, int M=25) { Py_ssize_t K; std::vector _y = translateLabels_fromR(y, /*out*/K); CMatrix _X(REAL(SEXP(X)), X.nrow(), X.ncol(), false); if (_X.nrow() < 1 || _X.nrow() != _y.size()) Rf_error("Incompatible X and y"); if (M <= 0) // M = min(n-1, M) in the constructor Rf_error("M must be positive."); WCNNIndex ind(_X, (Py_ssize_t)K, false, M); ind.set_labels(_y); return (double)ind.compute(); } genieclust/src/cvi_davies_bouldin.h0000644000176200001440000000762214634041051017211 0ustar liggesusers/* Internal cluster validity indices * * Code originally contributed in , * see https://doi.org/10.1016/j.ins.2021.10.004. * * Copyleft (C) 2020-2024, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __CVI_DAVIES_BOULDIN_H #define __CVI_DAVIES_BOULDIN_H #include "cvi.h" /** The Negated Davies-Bouldin Index * * See Def.5 in (Davies, Bouldin, 1979). Singletons are assumed * to have infinite dispersion (see discussion on p.225 therein). * * D.L. Davies, D.W. Bouldin, * A cluster separation measure, * IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI-1 (2), * 1979, pp. 224-227, doi:10.1109/TPAMI.1979.4766909 * * See the following paper for the formula and further discussion: * M. Gagolewski, M. Bartoszuk, A. Cena, * Are cluster validity measures (in)valid?, Information Sciences 581, * 620-636, 2021, DOI:10.1016/j.ins.2021.10.004 */ class DaviesBouldinIndex : public CentroidsBasedIndex { protected: std::vector R; ///< average distance between ///< cluster centroids and their members public: // Described in the base class DaviesBouldinIndex( const CMatrix& _X, const size_t _K, const bool _allow_undo=false) : CentroidsBasedIndex(_X, _K, _allow_undo), R(_K) { } // // Described in the base class // virtual void set_labels(const std::vector& _L) // { // CentroidsBasedIndex::set_labels(_L); // sets L, count and centroids // } // // Described in the base class // virtual void modify(size_t i, Py_ssize_t j) // { // // sets L[i]=j and updates count as well as centroids // CentroidsBasedIndex::modify(i, j); // } // // Described in the base class // virtual void undo() { // CentroidsBasedIndex::undo(); // } // Described in the base class virtual FLOAT_T compute() { // Compute the average distances between the cluster centroids // and their members. // The centroids are up-to-date. for (size_t i=0; i max_r) max_r = cur_r; } ret += max_r; } ret = -ret/(FLOAT_T)K; // negative!! GENIECLUST_ASSERT(ret < 1e-12); return ret; } }; #endif genieclust/NAMESPACE0000644000176200001440000000212114502231306013623 0ustar liggesusers# Generated by roxygen2: do not edit by hand S3method(gclust,default) S3method(gclust,dist) S3method(gclust,mst) S3method(genie,default) S3method(genie,dist) S3method(genie,mst) S3method(mst,default) S3method(mst,dist) export(adjusted_fm_score) export(adjusted_mi_score) export(adjusted_rand_score) export(bonferroni_index) export(calinski_harabasz_index) export(devergottini_index) export(dunnowa_index) export(emst_mlpack) export(fm_score) export(gclust) export(generalised_dunn_index) export(genie) export(gini_index) export(mi_score) export(mst) export(negated_ball_hall_index) export(negated_davies_bouldin_index) export(negated_wcss_index) export(normalized_clustering_accuracy) export(normalized_confusion_matrix) export(normalized_mi_score) export(normalized_pivoted_accuracy) export(normalizing_permutation) export(pair_sets_index) export(rand_score) export(silhouette_index) export(silhouette_w_index) export(wcnn_index) importFrom(Rcpp,evalCpp) importFrom(stats,cutree) importFrom(stats,dist) importFrom(stats,hclust) importFrom(utils,capture.output) useDynLib(genieclust, .registration=TRUE) genieclust/inst/0000755000176200001440000000000014431101205013360 5ustar liggesusersgenieclust/inst/CITATION0000644000176200001440000000141514431101205014516 0ustar liggesusersbibentry( bibtype = "article", title = "genieclust: Fast and robust hierarchical clustering", author = c(person("Marek", "Gagolewski")), journal = "SoftwareX", year = "2021", volume = "15", pages = "100722", doi = "10.1016/j.softx.2021.100722" ) bibentry( bibtype = "article", title = "Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm", author = c( person("Marek", "Gagolewski"), person("Maciej", "Bartoszuk"), person("Anna", "Cena") ), journal = "Information Sciences", year = "2016", volume = "363", pages = "8--23", doi = "10.1016/j.ins.2016.05.003" ) genieclust/man/0000755000176200001440000000000014650456731013202 5ustar liggesusersgenieclust/man/mst.Rd0000644000176200001440000001071214661565331014274 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/mst.R \name{mst} \alias{mst} \alias{mst.default} \alias{mst.dist} \title{Minimum Spanning Tree of the Pairwise Distance Graph} \usage{ mst(d, ...) \method{mst}{default}( d, distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"), M = 1L, cast_float32 = TRUE, verbose = FALSE, ... ) \method{mst}{dist}(d, M = 1L, verbose = FALSE, ...) } \arguments{ \item{d}{either a numeric matrix (or an object coercible to one, e.g., a data frame with numeric-like columns) or an object of class \code{dist}, see \code{\link[stats]{dist}}} \item{...}{further arguments passed to or from other methods} \item{distance}{metric used to compute the linkage, one of: \code{"euclidean"} (synonym: \code{"l2"}), \code{"manhattan"} (a.k.a. \code{"l1"} and \code{"cityblock"}), \code{"cosine"}} \item{M}{smoothing factor; \code{M} = 1 gives the selected \code{distance}; otherwise, the mutual reachability distance is used} \item{cast_float32}{logical; whether to compute the distances using 32-bit instead of 64-bit precision floating-point arithmetic (up to 2x faster)} \item{verbose}{logical; whether to print diagnostic messages and progress information} } \value{ Matrix of class \code{mst} with n-1 rows and 3 columns: \code{from}, \code{to} and \code{dist}. It holds \code{from} < \code{to}. Moreover, \code{dist} is sorted nondecreasingly. The i-th row gives the i-th edge of the MST. \code{(from[i], to[i])} defines the vertices (in 1,...,n) and \code{dist[i]} gives the weight, i.e., the distance between the corresponding points. The \code{method} attribute gives the name of the distance used. The \code{Labels} attribute gives the labels of all the input points. If \code{M} > 1, the \code{nn} attribute gives the indices of the \code{M}-1 nearest neighbours of each point. } \description{ An parallelised implementation of a Jarnik (Prim/Dijkstra)-like algorithm for determining a(*) minimum spanning tree (MST) of a complete undirected graph representing a set of n points with weights given by a pairwise distance matrix. (*) Note that there might be multiple minimum trees spanning a given graph. } \details{ If \code{d} is a numeric matrix of size \eqn{n p}, the \eqn{n (n-1)/2} distances are computed on the fly, so that \eqn{O(n M)} memory is used. The algorithm is parallelised; set the \code{OMP_NUM_THREADS} environment variable \code{\link[base]{Sys.setenv}} to control the number of threads used. Time complexity is \eqn{O(n^2)} for the method accepting an object of class \code{dist} and \eqn{O(p n^2)} otherwise. If \code{M} >= 2, then the mutual reachability distance \eqn{m(i,j)} with smoothing factor \code{M} (see Campello et al. 2013) is used instead of the chosen "raw" distance \eqn{d(i,j)}. It holds \eqn{m(i, j)=\max(d(i,j), c(i), c(j))}, where \eqn{c(i)} is \eqn{d(i, k)} with \eqn{k} being the (\code{M}-1)-th nearest neighbour of \eqn{i}. This makes "noise" and "boundary" points being "pulled away" from each other. Genie++ clustering algorithm (see \code{\link{gclust}}) with respect to the mutual reachability distance gains the ability to identify some observations are noise points. Note that the case \code{M} = 2 corresponds to the original distance, but we determine the 1-nearest neighbours separately as well, which is a bit suboptimal; you can file a feature request if this makes your data analysis tasks too slow. } \examples{ library("datasets") data("iris") X <- iris[1:4] tree <- mst(X) } \references{ Jarnik V., O jistem problemu minimalnim, \emph{Prace Moravske Prirodovedecke Spolecnosti} 6, 1930, 57-63. Olson C.F., Parallel algorithms for hierarchical clustering, \emph{Parallel Comput.} 21, 1995, 1313-1325. Prim R., Shortest connection networks and some generalisations, \emph{Bell Syst. Tech. J.} 36, 1957, 1389-1401. Campello R.J.G.B., Moulavi D., Sander J., Density-based clustering based on hierarchical density estimates, \emph{Lecture Notes in Computer Science} 7819, 2013, 160-172, \doi{10.1007/978-3-642-37456-2_14}. } \seealso{ The official online manual of \pkg{genieclust} at \url{https://genieclust.gagolewski.com/} Gagolewski M., \pkg{genieclust}: Fast and robust hierarchical clustering, \emph{SoftwareX} 15:100722, 2021, \doi{10.1016/j.softx.2021.100722}. \code{\link{emst_mlpack}()} for a very fast alternative in case of (very) low-dimensional Euclidean spaces (and \code{M} = 1). } \author{ \href{https://www.gagolewski.com/}{Marek Gagolewski} and other contributors } genieclust/man/gclust.Rd0000644000176200001440000001733114650456561015000 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/gclust.R \name{gclust} \alias{gclust} \alias{gclust.default} \alias{gclust.dist} \alias{gclust.mst} \alias{genie} \alias{genie.default} \alias{genie.dist} \alias{genie.mst} \title{Hierarchical Clustering Algorithm Genie} \usage{ gclust(d, ...) \method{gclust}{default}( d, gini_threshold = 0.3, distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"), cast_float32 = TRUE, verbose = FALSE, ... ) \method{gclust}{dist}(d, gini_threshold = 0.3, verbose = FALSE, ...) \method{gclust}{mst}(d, gini_threshold = 0.3, verbose = FALSE, ...) genie(d, ...) \method{genie}{default}( d, k, gini_threshold = 0.3, distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"), M = 1L, postprocess = c("boundary", "none", "all"), detect_noise = M > 1L, cast_float32 = TRUE, verbose = FALSE, ... ) \method{genie}{dist}( d, k, gini_threshold = 0.3, M = 1L, postprocess = c("boundary", "none", "all"), detect_noise = M > 1L, verbose = FALSE, ... ) \method{genie}{mst}( d, k, gini_threshold = 0.3, postprocess = c("boundary", "none", "all"), detect_noise = FALSE, verbose = FALSE, ... ) } \arguments{ \item{d}{a numeric matrix (or an object coercible to one, e.g., a data frame with numeric-like columns) or an object of class \code{dist}, see \code{\link[stats]{dist}} or an object of class \code{mst}, see \code{\link{mst}()}.} \item{...}{further arguments passed to other methods.} \item{gini_threshold}{threshold for the Genie correction, i.e., the Gini index of the cluster size distribution; Threshold of 1.0 disables the correction. Low thresholds highly penalise the formation of small clusters.} \item{distance}{metric used to compute the linkage, one of: \code{"euclidean"} (synonym: \code{"l2"}), \code{"manhattan"} (a.k.a. \code{"l1"} and \code{"cityblock"}), \code{"cosine"}.} \item{cast_float32}{logical; whether to compute the distances using 32-bit instead of 64-bit precision floating-point arithmetic (up to 2x faster).} \item{verbose}{logical; whether to print diagnostic messages and progress information.} \item{k}{the desired number of clusters to detect, \code{k} = 1 with \code{M} > 1 acts as a noise point detector.} \item{M}{smoothing factor; \code{M} <= 2 gives the selected \code{distance}; otherwise, the mutual reachability distance is used.} \item{postprocess}{one of \code{"boundary"} (default), \code{"none"} or \code{"all"}; in effect only if \code{M} > 1. By default, only "boundary" points are merged with their nearest "core" points (A point is a boundary point if it is a noise point and it's amongst its adjacent vertex's \code{M}-1 nearest neighbours). To force a classical k-partition of a data set (with no notion of noise), choose "all".} \item{detect_noise}{whether the minimum spanning tree's leaves should be marked as noise points, defaults to \code{TRUE} if \code{M} > 1 for compatibility with HDBSCAN*.} } \value{ \code{gclust()} computes the whole clustering hierarchy; it returns a list of class \code{hclust}, see \code{\link[stats]{hclust}}. Use \code{\link[stats]{cutree}} to obtain an arbitrary k-partition. \code{genie()} returns a \code{k}-partition - a vector with elements in 1,...,k, whose i-th element denotes the i-th input point's cluster identifier. Missing values (\code{NA}) denote noise points (if \code{detect_noise} is \code{TRUE}). } \description{ A reimplementation of \emph{Genie} - a robust and outlier resistant clustering algorithm (see Gagolewski, Bartoszuk, Cena, 2016). The Genie algorithm is based on a minimum spanning tree (MST) of the pairwise distance graph of a given point set. Just like the single linkage, it consumes the edges of the MST in an increasing order of weights. However, it prevents the formation of clusters of highly imbalanced sizes; once the Gini index (see \code{\link{gini_index}()}) of the cluster size distribution raises above \code{gini_threshold}, a forced merge of a point group of the smallest size is performed. Its appealing simplicity goes hand in hand with its usability; Genie often outperforms other clustering approaches on benchmark data, such as \url{https://github.com/gagolews/clustering-benchmarks}. The clustering can now also be computed with respect to the mutual reachability distance (based, e.g., on the Euclidean metric), which is used in the definition of the HDBSCAN* algorithm (see Campello et al., 2013). If \code{M} > 1, then the mutual reachability distance \eqn{m(i,j)} with smoothing factor \code{M} is used instead of the chosen "raw" distance \eqn{d(i,j)}. It holds \eqn{m(i,j)=\max(d(i,j), c(i), c(j))}, where \eqn{c(i)} is \eqn{d(i,k)} with \eqn{k} being the (\code{M}-1)-th nearest neighbour of \eqn{i}. This makes "noise" and "boundary" points being "pulled away" from each other. The Genie correction together with the smoothing factor \code{M} > 1 (note that \code{M} = 2 corresponds to the original distance) gives a robustified version of the HDBSCAN* algorithm that is able to detect a predefined number of clusters. Hence it does not dependent on the DBSCAN's somewhat magical \code{eps} parameter or the HDBSCAN's \code{min_cluster_size} one. } \details{ Note that, as in the case of all the distance-based methods, the standardisation of the input features is definitely worth giving a try. If \code{d} is a numeric matrix or an object of class \code{dist}, \code{\link{mst}()} will be called to compute an MST, which generally takes at most \eqn{O(n^2)} time (the algorithm we provide is parallelised, environment variable \code{OMP_NUM_THREADS} controls the number of threads in use). However, see \code{\link{emst_mlpack}()} for a very fast alternative in the case of Euclidean spaces of (very) low dimensionality and \code{M} = 1. Given an minimum spanning tree, the algorithm runs in \eqn{O(n \sqrt{n})} time. Therefore, if you want to test different \code{gini_threshold}s, (or \code{k}s), it is best to explicitly compute the MST first. According to the algorithm's original definition, the resulting partition tree (dendrogram) might violate the ultrametricity property (merges might occur at levels that are not increasing w.r.t. a between-cluster distance). \code{gclust()} automatically corrects departures from ultrametricity by applying \code{height = rev(cummin(rev(height)))}. } \examples{ library("datasets") data("iris") X <- iris[1:4] h <- gclust(X) y_pred <- cutree(h, 3) y_test <- iris[,5] plot(iris[,2], iris[,3], col=y_pred, pch=as.integer(iris[,5]), asp=1, las=1) adjusted_rand_score(y_test, y_pred) pair_sets_index(y_test, y_pred) # Fast for low-dimensional Euclidean spaces: # h <- gclust(emst_mlpack(X)) } \references{ Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, \emph{Information Sciences} 363, 2016, 8-23, \doi{10.1016/j.ins.2016.05.003}. Campello R.J.G.B., Moulavi D., Sander J., Density-based clustering based on hierarchical density estimates, \emph{Lecture Notes in Computer Science} 7819, 2013, 160-172, \doi{10.1007/978-3-642-37456-2_14}. Gagolewski M., Cena A., Bartoszuk M., Brzozowski L., Clustering with minimum spanning trees: How good can it be?, \emph{Journal of Classification}, 2024, in press, \doi{10.1007/s00357-024-09483-1}. } \seealso{ The official online manual of \pkg{genieclust} at \url{https://genieclust.gagolewski.com/} Gagolewski M., \pkg{genieclust}: Fast and robust hierarchical clustering, \emph{SoftwareX} 15:100722, 2021, \doi{10.1016/j.softx.2021.100722}. \code{\link{mst}()} for the minimum spanning tree routines. \code{\link{adjusted_rand_score}()} (amongst others) for external cluster validity measures (partition similarity scores). } \author{ \href{https://www.gagolewski.com/}{Marek Gagolewski} and other contributors } genieclust/man/inequality.Rd0000644000176200001440000000651214503236340015646 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/RcppExports.R \name{inequality} \alias{inequality} \alias{gini_index} \alias{bonferroni_index} \alias{devergottini_index} \title{Inequality Measures} \usage{ gini_index(x) bonferroni_index(x) devergottini_index(x) } \arguments{ \item{x}{numeric vector of non-negative values} } \value{ The value of the inequality index, a number in \eqn{[0, 1]}. } \description{ \code{gini_index()} gives the normalised Gini index, \code{bonferroni_index()} implements the Bonferroni index, and \code{devergottini_index()} implements the De Vergottini index. } \details{ These indices can be used to quantify the "inequality" of a numeric sample. They can be conceived as normalised measures of data dispersion. For constant vectors (perfect equity), the indices yield values of 0. Vectors with all elements but one equal to 0 (perfect inequality), are assigned scores of 1. They follow the Pigou-Dalton principle (are Schur-convex): setting \eqn{x_i = x_i - h} and \eqn{x_j = x_j + h} with \eqn{h > 0} and \eqn{x_i - h \geq x_j + h} (taking from the "rich" and giving to the "poor") decreases the inequality These indices have applications in economics, amongst others. The Genie clustering algorithm uses the Gini index as a measure of the inequality of cluster sizes. The normalised Gini index is given by: \deqn{ G(x_1,\dots,x_n) = \frac{ \sum_{i=1}^{n} (n-2i+1) x_{\sigma(n-i+1)} }{ (n-1) \sum_{i=1}^n x_i }, } The normalised Bonferroni index is given by: \deqn{ B(x_1,\dots,x_n) = \frac{ \sum_{i=1}^{n} (n-\sum_{j=1}^i \frac{n}{n-j+1}) x_{\sigma(n-i+1)} }{ (n-1) \sum_{i=1}^n x_i }. } The normalised De Vergottini index is given by: \deqn{ V(x_1,\dots,x_n) = \frac{1}{\sum_{i=2}^n \frac{1}{i}} \left( \frac{ \sum_{i=1}^n \left( \sum_{j=i}^{n} \frac{1}{j}\right) x_{\sigma(n-i+1)} }{\sum_{i=1}^{n} x_i} - 1 \right). } Here, \eqn{\sigma} is an ordering permutation of \eqn{(x_1,\dots,x_n)}. Time complexity: \eqn{O(n)} for sorted (increasingly) data. Otherwise, the vector will be sorted. } \examples{ gini_index(c(2, 2, 2, 2, 2)) # no inequality gini_index(c(0, 0, 10, 0, 0)) # one has it all gini_index(c(7, 0, 3, 0, 0)) # give to the poor, take away from the rich gini_index(c(6, 0, 3, 1, 0)) # (a.k.a. Pigou-Dalton principle) bonferroni_index(c(2, 2, 2, 2, 2)) bonferroni_index(c(0, 0, 10, 0, 0)) bonferroni_index(c(7, 0, 3, 0, 0)) bonferroni_index(c(6, 0, 3, 1, 0)) devergottini_index(c(2, 2, 2, 2, 2)) devergottini_index(c(0, 0, 10, 0, 0)) devergottini_index(c(7, 0, 3, 0, 0)) devergottini_index(c(6, 0, 3, 1, 0)) } \references{ Bonferroni C., \emph{Elementi di Statistica Generale}, Libreria Seber, Firenze, 1930. Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, \emph{Information Sciences} 363, 2016, pp. 8-23. \doi{10.1016/j.ins.2016.05.003} Gini C., \emph{Variabilita e Mutabilita}, Tipografia di Paolo Cuppini, Bologna, 1912. } \author{ \href{https://www.gagolewski.com/}{Marek Gagolewski} and other contributors } \seealso{ The official online manual of \pkg{genieclust} at \url{https://genieclust.gagolewski.com/} Gagolewski M., \pkg{genieclust}: Fast and robust hierarchical clustering, \emph{SoftwareX} 15:100722, 2021, \doi{10.1016/j.softx.2021.100722}. } genieclust/man/cluster_validity.Rd0000644000176200001440000001111414650456561017056 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/RcppExports.R \name{cluster_validity} \alias{cluster_validity} \alias{calinski_harabasz_index} \alias{dunnowa_index} \alias{generalised_dunn_index} \alias{negated_ball_hall_index} \alias{negated_davies_bouldin_index} \alias{negated_wcss_index} \alias{silhouette_index} \alias{silhouette_w_index} \alias{wcnn_index} \title{Internal Cluster Validity Measures} \usage{ calinski_harabasz_index(X, y) dunnowa_index( X, y, M = 25L, owa_numerator = "SMin:5", owa_denominator = "Const" ) generalised_dunn_index(X, y, lowercase_d, uppercase_d) negated_ball_hall_index(X, y) negated_davies_bouldin_index(X, y) negated_wcss_index(X, y) silhouette_index(X, y) silhouette_w_index(X, y) wcnn_index(X, y, M = 25L) } \arguments{ \item{X}{numeric matrix with \code{n} rows and \code{d} columns, representing \code{n} points in a \code{d}-dimensional space} \item{y}{vector of \code{n} integer labels, representing a partition whose \emph{quality} is to be assessed; \code{y[i]} is the cluster ID of the \code{i}-th point, \code{X[i, ]}; \code{1 <= y[i] <= K}, where \code{K} is the number or clusters} \item{M}{number of nearest neighbours} \item{owa_numerator, owa_denominator}{single string specifying the OWA operators to use in the definition of the DuNN index; one of: \code{"Mean"}, \code{"Min"}, \code{"Max"}, \code{"Const"}, \code{"SMin:D"}, \code{"SMax:D"}, where \code{D} is an integer defining the degree of smoothness} \item{lowercase_d}{an integer between 1 and 5, denoting \eqn{d_1}, ..., \eqn{d_5} in the definition of the generalised Dunn (Bezdek-Pal) index (numerator: min, max, and mean pairwise intracluster distance, distance between cluster centroids, weighted point-centroid distance, respectively)} \item{uppercase_d}{an integer between 1 and 3, denoting \eqn{D_1}, ..., \eqn{D_3} in the definition of the generalised Dunn (Bezdek-Pal) index (denominator: max and min pairwise intracluster distance, average point-centroid distance, respectively)} } \value{ A single numeric value (the more, the \emph{better}). } \description{ Implementation of a number of so-called cluster validity indices critically reviewed in (Gagolewski, Bartoszuk, Cena, 2021). See Section 2 therein and (Gagolewski, 2022) for the respective definitions. The greater the index value, the more \emph{valid} (whatever that means) the assessed partition. For consistency, the Ball-Hall and Davies-Bouldin indexes as well as the within-cluster sum of squares (WCSS) take negative values. } \examples{ X <- as.matrix(iris[,1:4]) X[,] <- jitter(X) # otherwise we get a non-unique solution y <- as.integer(iris[[5]]) calinski_harabasz_index(X, y) # good calinski_harabasz_index(X, sample(1:3, nrow(X), replace=TRUE)) # bad } \references{ Ball G.H., Hall D.J., \emph{ISODATA: A novel method of data analysis and pattern classification}, Technical report No. AD699616, Stanford Research Institute, 1965. Bezdek J., Pal N., Some new indexes of cluster validity, \emph{IEEE Transactions on Systems, Man, and Cybernetics, Part B} 28, 1998, 301-315, \doi{10.1109/3477.678624}. Calinski T., Harabasz J., A dendrite method for cluster analysis, \emph{Communications in Statistics} 3(1), 1974, 1-27, \doi{10.1080/03610927408827101}. Davies D.L., Bouldin D.W., A Cluster Separation Measure, \emph{IEEE Transactions on Pattern Analysis and Machine Intelligence} PAMI-1 (2), 1979, 224-227, \doi{10.1109/TPAMI.1979.4766909}. Dunn J.C., A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters, \emph{Journal of Cybernetics} 3(3), 1973, 32-57, \doi{10.1080/01969727308546046}. Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, \emph{Information Sciences} 581, 620-636, 2021, \doi{10.1016/j.ins.2021.10.004}; preprint: \url{https://raw.githubusercontent.com/gagolews/bibliography/master/preprints/2021cvi.pdf}. Gagolewski M., A Framework for Benchmarking Clustering Algorithms, \emph{SoftwareX} 20, 2022, 101270, \doi{10.1016/j.softx.2022.101270}, \url{https://clustering-benchmarks.gagolewski.com}. Rousseeuw P.J., Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis, \emph{Computational and Applied Mathematics} 20, 1987, 53-65, \doi{10.1016/0377-0427(87)90125-7}. } \author{ \href{https://www.gagolewski.com/}{Marek Gagolewski} and other contributors } \seealso{ The official online manual of \pkg{genieclust} at \url{https://genieclust.gagolewski.com/} Gagolewski M., \pkg{genieclust}: Fast and robust hierarchical clustering, \emph{SoftwareX} 15:100722, 2021, \doi{10.1016/j.softx.2021.100722}. } genieclust/man/compare_partitions.Rd0000644000176200001440000001742114650457143017376 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/RcppExports.R \name{compare_partitions} \alias{compare_partitions} \alias{normalized_clustering_accuracy} \alias{normalized_pivoted_accuracy} \alias{pair_sets_index} \alias{adjusted_rand_score} \alias{rand_score} \alias{adjusted_fm_score} \alias{fm_score} \alias{mi_score} \alias{normalized_mi_score} \alias{adjusted_mi_score} \alias{normalized_confusion_matrix} \alias{normalizing_permutation} \title{External Cluster Validity Measures and Pairwise Partition Similarity Scores} \usage{ normalized_clustering_accuracy(x, y = NULL) normalized_pivoted_accuracy(x, y = NULL) pair_sets_index(x, y = NULL, simplified = FALSE, clipped = TRUE) adjusted_rand_score(x, y = NULL, clipped = FALSE) rand_score(x, y = NULL) adjusted_fm_score(x, y = NULL, clipped = FALSE) fm_score(x, y = NULL) mi_score(x, y = NULL) normalized_mi_score(x, y = NULL) adjusted_mi_score(x, y = NULL, clipped = FALSE) normalized_confusion_matrix(x, y = NULL) normalizing_permutation(x, y = NULL) } \arguments{ \item{x}{an integer vector of length n (or an object coercible to) representing a K-partition of an n-set (e.g., a reference partition), or a confusion matrix with K rows and L columns (see \code{\link{table}(x, y)})} \item{y}{an integer vector of length n (or an object coercible to) representing an L-partition of the same set (e.g., the output of a clustering algorithm we wish to compare with \code{x}), or NULL (if x is an K*L confusion matrix)} \item{simplified}{whether to assume E=1 in the definition of the pair sets index index, i.e., use Eq. (20) in (Rezaei, Franti, 2016) instead of Eq. (18)} \item{clipped}{whether the result should be clipped to the unit interval, i.e., [0, 1]} } \value{ Each cluster validity measure is a single numeric value. \code{normalized_confusion_matrix()} returns a numeric matrix. \code{normalizing_permutation()} returns a vector of indexes. } \description{ The functions described in this section quantify the similarity between two label vectors \code{x} and \code{y} which represent two partitions of a set of \eqn{n} elements into, respectively, \eqn{K} and \eqn{L} nonempty and pairwise disjoint subsets. For instance, \code{x} and \code{y} can represent two clusterings of a dataset with \eqn{n} observations specified by two vectors of labels. The functions described here can be used as external cluster validity measures, where we assume that \code{x} is a reference (ground-truth) partition whilst \code{y} is the vector of predicted cluster memberships. All indices except \code{normalized_clustering_accuracy()} can act as a pairwise partition similarity score: they are symmetric, i.e., \code{index(x, y) == index(y, x)}. Each index except \code{mi_score()} (which computes the mutual information score) outputs 1 given two identical partitions. Note that partitions are always defined up to a permutation (bijection) of the set of possible labels, e.g., (1, 1, 2, 1) and (4, 4, 2, 4) represent the same 2-partition. } \details{ \code{normalized_clustering_accuracy()} (Gagolewski, 2023) is an asymmetric external cluster validity measure which assumes that the label vector \code{x} (or rows in the confusion matrix) represents the reference (ground truth) partition. It is an average proportion of correctly classified points in each cluster above the worst case scenario of uniform membership assignment, with cluster ID matching based on the solution to the maximal linear sum assignment problem; see \code{\link{normalized_confusion_matrix}}). It is given by: \eqn{\max_\sigma \frac{1}{K} \sum_{j=1}^K \frac{c_{\sigma(j), j}-c_{\sigma(j),\cdot}/K}{c_{\sigma(j),\cdot}-c_{\sigma(j),\cdot}/K}}, where \eqn{C} is a confusion matrix with \eqn{K} rows and \eqn{L} columns, \eqn{\sigma} is a permutation of the set \eqn{\{1,\dots,\max(K,L)\}}, and \eqn{c_{i, \cdot}=c_{i, 1}+...+c_{i, L}} is the i-th row sum, under the assumption that \eqn{c_{i,j}=0} for \eqn{i>K} or \eqn{j>L} and \eqn{0/0=0}. \code{normalized_pivoted_accuracy()} is defined as \eqn{(\max_\sigma \sum_{j=1}^{\max(K,L)} c_{\sigma(j),j}/n-1/\max(K,L))/(1-1/\max(K,L))}, where \eqn{\sigma} is a permutation of the set \eqn{\{1,\dots,\max(K,L)\}}, and \eqn{n} is the sum of all elements in \eqn{C}. For non-square matrices, missing rows/columns are assumed to be filled with 0s. \code{pair_sets_index()} (PSI) was introduced in (Rezaei, Franti, 2016). The simplified PSI assumes E=1 in the definition of the index, i.e., uses Eq. (20) in the said paper instead of Eq. (18). For non-square matrices, missing rows/columns are assumed to be filled with 0s. \code{rand_score()} gives the Rand score (the "probability" of agreement between the two partitions) and \code{adjusted_rand_score()} is its version corrected for chance, see (Hubert, Arabie, 1985): its expected value is 0 given two independent partitions. Due to the adjustment, the resulting index may be negative for some inputs. Similarly, \code{fm_score()} gives the Fowlkes-Mallows (FM) score and \code{adjusted_fm_score()} is its adjusted-for-chance version; see (Hubert, Arabie, 1985). \code{mi_score()}, \code{adjusted_mi_score()} and \code{normalized_mi_score()} are information-theoretic scores, based on mutual information, see the definition of \eqn{AMI_{sum}} and \eqn{NMI_{sum}} in (Vinh et al., 2010). \code{normalized_confusion_matrix()} computes the confusion matrix and permutes its rows and columns so that the sum of the elements of the main diagonal is the largest possible (by solving the maximal assignment problem). The function only accepts \eqn{K \leq L}. The reordering of the columns of a confusion matrix can be determined by calling \code{normalizing_permutation()}. Also note that the built-in \code{\link{table}()} determines the standard confusion matrix. } \examples{ y_true <- iris[[5]] y_pred <- kmeans(as.matrix(iris[1:4]), 3)$cluster normalized_clustering_accuracy(y_true, y_pred) normalized_pivoted_accuracy(y_true, y_pred) pair_sets_index(y_true, y_pred) pair_sets_index(y_true, y_pred, simplified=TRUE) adjusted_rand_score(y_true, y_pred) rand_score(table(y_true, y_pred)) # the same adjusted_fm_score(y_true, y_pred) fm_score(y_true, y_pred) mi_score(y_true, y_pred) normalized_mi_score(y_true, y_pred) adjusted_mi_score(y_true, y_pred) normalized_confusion_matrix(y_true, y_pred) normalizing_permutation(y_true, y_pred) } \references{ Gagolewski M., A framework for benchmarking clustering algorithms, \emph{SoftwareX} 20, 2022, 101270, \doi{10.1016/j.softx.2022.101270}, \url{https://clustering-benchmarks.gagolewski.com}. Gagolewski M., Normalised clustering accuracy: An asymmetric external cluster validity measure, \emph{Journal of Classification}, 2024, in press, \doi{10.1007/s00357-024-09482-2}. Hubert L., Arabie P., Comparing partitions, \emph{Journal of Classification} 2(1), 1985, 193-218, esp. Eqs. (2) and (4). Meila M., Heckerman D., An experimental comparison of model-based clustering methods, \emph{Machine Learning} 42, 2001, pp. 9-29, \doi{10.1023/A:1007648401407}. Rezaei M., Franti P., Set matching measures for external cluster validity, \emph{IEEE Transactions on Knowledge and Data Mining} 28(8), 2016, 2173-2186. Steinley D., Properties of the Hubert-Arabie adjusted Rand index, \emph{Psychological Methods} 9(3), 2004, pp. 386-396, \doi{10.1037/1082-989X.9.3.386}. Vinh N.X., Epps J., Bailey J., Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance, \emph{Journal of Machine Learning Research} 11, 2010, 2837-2854. } \author{ \href{https://www.gagolewski.com/}{Marek Gagolewski} and other contributors } \seealso{ The official online manual of \pkg{genieclust} at \url{https://genieclust.gagolewski.com/} Gagolewski M., \pkg{genieclust}: Fast and robust hierarchical clustering, \emph{SoftwareX} 15:100722, 2021, \doi{10.1016/j.softx.2021.100722}. } genieclust/man/emst_mlpack.Rd0000644000176200001440000000335414300276713015765 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/mst.R \name{emst_mlpack} \alias{emst_mlpack} \title{Euclidean Minimum Spanning Tree} \usage{ emst_mlpack(X, leaf_size = 1, naive = FALSE, verbose = FALSE) } \arguments{ \item{X}{a numeric matrix (or an object coercible to one, e.g., a data frame with numeric-like columns)} \item{leaf_size}{size of leaves in the kd-tree, controls the trade-off between speed and memory consumption} \item{naive}{logical; whether to use the naive, quadratic-time algorithm} \item{verbose}{logical; whether to print diagnostic messages} } \value{ An object of class \code{mst}, see \code{\link{mst}()} for details. } \description{ Provides access to the implementation of the Dual-Tree Boruvka algorithm from the \code{mlpack} package (if available). It is based on kd-trees and is fast for (very) low-dimensional Euclidean spaces. For higher dimensional spaces (say, over 5 features) or other metrics, use the parallelised Prim-like algorithm implemented in \code{\link{mst}()}. } \references{ March W.B., Ram P., Gray A.G., Fast Euclidean Minimum Spanning Tree: Algorithm, Analysis, and Applications, \emph{Proc. ACM SIGKDD'10}, 2010, 603-611, \url{https://mlpack.org/papers/emst.pdf}. Curtin R.R., Edel M., Lozhnikov M., Mentekidis Y., Ghaisas S., Zhang S., mlpack 3: A fast, flexible machine learning library, \emph{Journal of Open Source Software} 3(26), 2018, 726. } \author{ \href{https://www.gagolewski.com/}{Marek Gagolewski} and other contributors } \seealso{ The official online manual of \pkg{genieclust} at \url{https://genieclust.gagolewski.com/} Gagolewski M., \pkg{genieclust}: Fast and robust hierarchical clustering, \emph{SoftwareX} 15:100722, 2021, \doi{10.1016/j.softx.2021.100722}. } genieclust/man/genieclust-package.Rd0000644000176200001440000000216414661566023017225 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/genieclust-package.R \docType{package} \name{genieclust-package} \alias{genieclust} \alias{genieclust-package} \title{The Genie Hierarchical Clustering Algorithm (with Extras)} \description{ See \code{\link{genie}()} for more details. } \seealso{ The official online manual of \pkg{genieclust} at \url{https://genieclust.gagolewski.com/} Gagolewski M., \pkg{genieclust}: Fast and robust hierarchical clustering, \emph{SoftwareX} 15:100722, 2021, \doi{10.1016/j.softx.2021.100722}. Useful links: \itemize{ \item \url{https://genieclust.gagolewski.com/} \item \url{https://clustering-benchmarks.gagolewski.com/} \item \url{https://github.com/gagolews/genieclust} \item Report bugs at \url{https://github.com/gagolews/genieclust/issues} } } \author{ \strong{Maintainer}: Marek Gagolewski \email{marek@gagolewski.com} (\href{https://orcid.org/0000-0003-0637-6028}{ORCID}) [copyright holder] Other contributors: \itemize{ \item Maciej Bartoszuk [contributor] \item Anna Cena [contributor] \item Peter M. Larsen [contributor] } } \keyword{internal} genieclust/DESCRIPTION0000644000176200001440000000472614661666071014150 0ustar liggesusersPackage: genieclust Type: Package Title: Fast and Robust Hierarchical Clustering with Noise Points Detection Version: 1.1.6 Date: 2024-08-22 Authors@R: c( person("Marek", "Gagolewski", role = c("aut", "cre", "cph"), email = "marek@gagolewski.com", comment = c(ORCID = "0000-0003-0637-6028")), person("Maciej", "Bartoszuk", role = c("ctb")), person("Anna", "Cena", role = c("ctb")), person("Peter M.", "Larsen", role = c("ctb")) ) Description: A retake on the Genie algorithm (Gagolewski, 2021 ) - a robust hierarchical clustering method (Gagolewski, Bartoszuk, Cena, 2016 ). Now faster and more memory efficient; determining the whole hierarchy for datasets of 10M points in low dimensional Euclidean spaces or 100K points in high-dimensional ones takes only 1-2 minutes. Allows clustering with respect to mutual reachability distances so that it can act as a noise point detector or a robustified version of 'HDBSCAN*' (that is able to detect a predefined number of clusters and hence it does not dependent on the somewhat fragile 'eps' parameter). The package also features an implementation of inequality indices (the Gini, Bonferroni index), external cluster validity measures (e.g., the normalised clustering accuracy and partition similarity scores such as the adjusted Rand, Fowlkes-Mallows, adjusted mutual information, and the pair sets index), and internal cluster validity indices (e.g., the Calinski-Harabasz, Davies-Bouldin, Ball-Hall, Silhouette, and generalised Dunn indices). See also the 'Python' version of 'genieclust' available on 'PyPI', which supports sparse data, more metrics, and even larger datasets. BugReports: https://github.com/gagolews/genieclust/issues URL: https://genieclust.gagolewski.com/, https://clustering-benchmarks.gagolewski.com/, https://github.com/gagolews/genieclust License: AGPL-3 Imports: Rcpp (>= 1.0.4), stats, utils Suggests: datasets, mlpack LinkingTo: Rcpp Encoding: UTF-8 SystemRequirements: OpenMP RoxygenNote: 7.3.2 NeedsCompilation: yes Packaged: 2024-08-22 08:18:47 UTC; gagolews Author: Marek Gagolewski [aut, cre, cph] (), Maciej Bartoszuk [ctb], Anna Cena [ctb], Peter M. Larsen [ctb] Maintainer: Marek Gagolewski Repository: CRAN Date/Publication: 2024-08-22 16:50:01 UTC