genieclust/0000755000176200001440000000000014040221650012406 5ustar liggesusersgenieclust/NAMESPACE0000644000176200001440000000130714040206352013630 0ustar liggesusers# Generated by roxygen2: do not edit by hand S3method(gclust,default) S3method(gclust,dist) S3method(gclust,mst) S3method(genie,default) S3method(genie,dist) S3method(genie,mst) S3method(mst,default) S3method(mst,dist) export(adjusted_fm_score) export(adjusted_mi_score) export(adjusted_rand_score) export(bonferroni_index) export(emst_mlpack) export(fm_score) export(gclust) export(genie) export(gini_index) export(mi_score) export(mst) export(normalized_accuracy) export(normalized_mi_score) export(pair_sets_index) export(rand_score) importFrom(Rcpp,evalCpp) importFrom(stats,cutree) importFrom(stats,dist) importFrom(stats,hclust) importFrom(utils,capture.output) useDynLib(genieclust, .registration=TRUE) genieclust/man/0000755000176200001440000000000013677266535013213 5ustar liggesusersgenieclust/man/emst_mlpack.Rd0000644000176200001440000000262014040203074015747 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/mst.R \name{emst_mlpack} \alias{emst_mlpack} \title{Euclidean Minimum Spanning Tree} \usage{ emst_mlpack(X, leaf_size = 1, naive = FALSE, verbose = FALSE) } \arguments{ \item{X}{a numeric matrix (or an object coercible to one, e.g., a data frame with numeric-like columns)} \item{leaf_size}{size of leaves in the kd-tree, controls the trade-off between speed and memory consumption} \item{naive}{logical; whether to use the naive, quadratic-time algorithm} \item{verbose}{logical; whether to print diagnostic messages} } \value{ An object of class \code{mst}, see \code{\link{mst}()} for details. } \description{ Provides access to the implementation of the Dual-Tree Borůvka algorithm from the \code{mlpack} package (if available). It is based on kd-trees and is fast for (very) low-dimensional Euclidean spaces. For higher dimensional spaces (say, over 5 features) or other metrics, use the parallelised Prim-like algorithm implemented in \code{\link{mst}()}. } \references{ March W.B., Ram P., Gray A.G., Fast Euclidean Minimum Spanning Tree: Algorithm, Analysis, and Applications, Proc. ACM SIGKDD'10 (2010) 603-611, \url{https://mlpack.org/papers/emst.pdf}. Curtin R.R., Edel M., Lozhnikov M., Mentekidis Y., Ghaisas S., Zhang S., mlpack 3: A fast, flexible machine learning library, Journal of Open Source Software 3(26), 726, 2018. } genieclust/man/genieclust-package.Rd0000644000176200001440000000053113677274320017223 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/genieclust-package.R \docType{package} \name{genieclust-package} \alias{genieclust-package} \alias{genieclust} \title{The Genie++ Hierarchical Clustering Algorithm (with Extras)} \description{ See \code{\link{genie}()} for more details. } \author{ Marek Gagolewski } genieclust/man/mst.Rd0000644000176200001440000001022613716211073014263 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/mst.R \name{mst} \alias{mst} \alias{mst.default} \alias{mst.dist} \title{Minimum Spanning Tree of the Pairwise Distance Graph} \usage{ mst(d, ...) \method{mst}{default}( d, distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"), M = 1L, cast_float32 = TRUE, verbose = FALSE, ... ) \method{mst}{dist}(d, M = 1L, verbose = FALSE, ...) } \arguments{ \item{d}{either a numeric matrix (or an object coercible to one, e.g., a data frame with numeric-like columns) or an object of class \code{dist}, see \code{\link[stats]{dist}}.} \item{...}{further arguments passed to or from other methods.} \item{distance}{metric used to compute the linkage, one of: \code{"euclidean"} (synonym: \code{"l2"}), \code{"manhattan"} (a.k.a. \code{"l1"} and \code{"cityblock"}), \code{"cosine"}.} \item{M}{smoothing factor; \code{M} = 1 gives the selected \code{distance}; otherwise, the mutual reachability distance is used.} \item{cast_float32}{logical; whether to compute the distances using 32-bit instead of 64-bit precision floating-point arithmetic (up to 2x faster).} \item{verbose}{logical; whether to print diagnostic messages and progress information.} } \value{ Matrix of class \code{mst} with n-1 rows and 3 columns: \code{from}, \code{to} and \code{dist}. It holds \code{from} < \code{to}. Moreover, \code{dist} is sorted nondecreasingly. The i-th row gives the i-th edge of the MST. \code{(from[i], to[i])} defines the vertices (in 1,...,n) and \code{dist[i]} gives the weight, i.e., the distance between the corresponding points. The \code{method} attribute gives the name of the distance used. The \code{Labels} attribute gives the labels of all the input points. If \code{M} > 1, the \code{nn} attribute gives the indices of the \code{M}-1 nearest neighbours of each point. } \description{ An parallelised implementation of a Jarník (Prim/Dijkstra)-like algorithm for determining a(*) minimum spanning tree (MST) of a complete undirected graph representing a set of n points with weights given by a pairwise distance matrix. (*) Note that there might be multiple minimum trees spanning a given graph. } \details{ If \code{d} is a numeric matrix of size \eqn{n p}, the \eqn{n (n-1)/2} distances are computed on the fly, so that \eqn{O(n M)} memory is used. The algorithm is parallelised; set the \code{OMP_NUM_THREADS} environment variable \code{\link[base]{Sys.setenv}} to control the number of threads used. Time complexity is \eqn{O(n^2)} for the method accepting an object of class \code{dist} and \eqn{O(p n^2)} otherwise. If \code{M} >= 2, then the mutual reachability distance \eqn{m(i,j)} with smoothing factor \code{M} (see Campello et al. 2015) is used instead of the chosen "raw" distance \eqn{d(i,j)}. It holds \eqn{m(i, j)=\max(d(i,j), c(i), c(j))}, where \eqn{c(i)} is \eqn{d(i, k)} with \eqn{k} being the (\code{M}-1)-th nearest neighbour of \eqn{i}. This makes "noise" and "boundary" points being "pulled away" from each other. Genie++ clustering algorithm (see \code{\link{gclust}}) with respect to the mutual reachability distance gains the ability to identify some observations are noise points. Note that the case \code{M} = 2 corresponds to the original distance, but we are determining the 1-nearest neighbours separately as well, which is a bit suboptimal; you can file a feature request if this makes your data analysis tasks too slow. } \examples{ library("datasets") data("iris") X <- iris[1:4] tree <- mst(X) } \references{ Jarník V., O jistém problému minimálním, Práce Moravské Přírodovědecké Společnosti 6 (1930) 57–63. Olson C.F., Parallel algorithms for hierarchical clustering, Parallel Comput. 21 (1995) 1313–1325. Prim R., Shortest connection networks and some generalisations, Bell Syst. Tech. J. 36 (1957) 1389–1401. Campello R., Moulavi D., Zimek A., Sander J., Hierarchical density estimates for data clustering, visualization, and outlier detection, ACM Transactions on Knowledge Discovery from Data 10(1) (2015) 5:1–5:51. } \seealso{ \code{\link{emst_mlpack}()} for a very fast alternative in case of (very) low-dimensional Euclidean spaces (and \code{M} = 1). } genieclust/man/gclust.Rd0000644000176200001440000001622314002205033014750 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/gclust.R \name{gclust} \alias{gclust} \alias{gclust.default} \alias{gclust.dist} \alias{gclust.mst} \alias{genie} \alias{genie.default} \alias{genie.dist} \alias{genie.mst} \title{The Genie++ Hierarchical Clustering Algorithm} \usage{ gclust(d, ...) \method{gclust}{default}( d, gini_threshold = 0.3, distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"), cast_float32 = TRUE, verbose = FALSE, ... ) \method{gclust}{dist}(d, gini_threshold = 0.3, verbose = FALSE, ...) \method{gclust}{mst}(d, gini_threshold = 0.3, verbose = FALSE, ...) genie(d, ...) \method{genie}{default}( d, k, gini_threshold = 0.3, distance = c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"), M = 1L, postprocess = c("boundary", "none", "all"), detect_noise = M > 1L, cast_float32 = TRUE, verbose = FALSE, ... ) \method{genie}{dist}( d, k, gini_threshold = 0.3, M = 1L, postprocess = c("boundary", "none", "all"), detect_noise = M > 1L, verbose = FALSE, ... ) \method{genie}{mst}( d, k, gini_threshold = 0.3, postprocess = c("boundary", "none", "all"), detect_noise = FALSE, verbose = FALSE, ... ) } \arguments{ \item{d}{a numeric matrix (or an object coercible to one, e.g., a data frame with numeric-like columns) or an object of class \code{dist}, see \code{\link[stats]{dist}} or an object of class \code{mst}, see \code{\link{mst}()}.} \item{...}{further arguments passed to other methods.} \item{gini_threshold}{threshold for the Genie correction, i.e., the Gini index of the cluster size distribution; Threshold of 1.0 disables the correction. Low thresholds highly penalise the formation of small clusters.} \item{distance}{metric used to compute the linkage, one of: \code{"euclidean"} (synonym: \code{"l2"}), \code{"manhattan"} (a.k.a. \code{"l1"} and \code{"cityblock"}), \code{"cosine"}.} \item{cast_float32}{logical; whether to compute the distances using 32-bit instead of 64-bit precision floating-point arithmetic (up to 2x faster).} \item{verbose}{logical; whether to print diagnostic messages and progress information.} \item{k}{the desired number of clusters to detect, \code{k} = 1 with \code{M} > 1 acts as a noise point detector.} \item{M}{smoothing factor; \code{M} <= 2 gives the selected \code{distance}; otherwise, the mutual reachability distance is used.} \item{postprocess}{one of \code{"boundary"} (default), \code{"none"} or \code{"all"}; in effect only if \code{M} > 1. By default, only "boundary" points are merged with their nearest "core" points (A point is a boundary point if it is a noise point and it's amongst its adjacent vertex's \code{M}-1 nearest neighbours). To force a classical k-partition of a data set (with no notion of noise), choose "all".} \item{detect_noise}{whether the minimum spanning tree's leaves should be marked as noise points, defaults to \code{TRUE} if \code{M} > 1 for compatibility with HDBSCAN*.} } \value{ \code{gclust()} computes the whole clustering hierarchy; it returns a list of class \code{hclust}, see \code{\link[stats]{hclust}}. Use \code{link{cutree}()} to obtain an arbitrary k-partition. \code{genie()} returns a \code{k}-partition - a vector with elements in 1,...,k, whose i-th element denotes the i-th input point's cluster identifier. Missing values (\code{NA}) denote noise points (if \code{detect_noise} is \code{TRUE}). } \description{ A reimplementation of \emph{Genie} - a robust and outlier resistant clustering algorithm (see Gagolewski, Bartoszuk, Cena, 2016). The Genie algorithm is based on a minimum spanning tree (MST) of the pairwise distance graph of a given point set. Just like single linkage, it consumes the edges of the MST in increasing order of weights. However, it prevents the formation of clusters of highly imbalanced sizes; once the Gini index (see \code{\link{gini_index}()}) of the cluster size distribution raises above \code{gini_threshold}, a forced merge of a point group of the smallest size is performed. Its appealing simplicity goes hand in hand with its usability; Genie often outperforms other clustering approaches on benchmark data, such as \url{https://github.com/gagolews/clustering_benchmarks_v1}. The clustering can now also be computed with respect to the mutual reachability distance (based, e.g., on the Euclidean metric), which is used in the definition of the HDBSCAN* algorithm (see Campello et al., 2015). If \code{M} > 1, then the mutual reachability distance \eqn{m(i,j)} with smoothing factor \code{M} is used instead of the chosen "raw" distance \eqn{d(i,j)}. It holds \eqn{m(i,j)=\max(d(i,j), c(i), c(j))}, where \eqn{c(i)} is \eqn{d(i,k)} with \eqn{k} being the (\code{M}-1)-th nearest neighbour of \eqn{i}. This makes "noise" and "boundary" points being "pulled away" from each other. The Genie correction together with the smoothing factor \code{M} > 1 (note that \code{M} = 2 corresponds to the original distance) gives a robustified version of the HDBSCAN* algorithm that is able to detect a predefined number of clusters. Hence it does not dependent on the DBSCAN's somewhat magical \code{eps} parameter or the HDBSCAN's \code{min_cluster_size} one. } \details{ Note that as in the case of all the distance-based methods, the standardisation of the input features is definitely worth giving a try. If \code{d} is a numeric matrix or an object of class \code{dist}, \code{\link{mst}()} will be called to compute an MST, which generally takes at most \eqn{O(n^2)} time (the algorithm we provide is parallelised, environment variable \code{OMP_NUM_THREADS} controls the number of threads in use). However, see \code{\link{emst_mlpack}()} for a very fast alternative in the case of Euclidean spaces of (very) low dimensionality and \code{M} = 1. Given an minimum spanning tree, the algorithm runs in \eqn{O(n \sqrt{n})} time. Therefore, if you want to test different \code{gini_threshold}s, (or \code{k}s), it is best to explicitly compute the MST first. According to the algorithm's original definition, the resulting partition tree (dendrogram) might violate the ultrametricity property (merges might occur at levels that are not increasing w.r.t. a between-cluster distance). Departures from ultrametricity are corrected by applying \code{height = rev(cummin(rev(height)))}. } \examples{ library("datasets") data("iris") X <- iris[1:4] h <- gclust(X) y_pred <- cutree(h, 3) y_test <- iris[,5] plot(iris[,2], iris[,3], col=y_pred, pch=as.integer(iris[,5]), asp=1, las=1) adjusted_rand_score(y_test, y_pred) pair_sets_index(y_test, y_pred) # Fast for low-dimensional Euclidean spaces: h <- gclust(emst_mlpack(X)) } \references{ Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, \emph{Information Sciences} 363, 2016, 8-23. Campello R., Moulavi D., Zimek A., Sander J., Hierarchical density estimates for data clustering, visualization, and outlier detection, ACM Transactions on Knowledge Discovery from Data 10(1), 2015, 5:1–5:51. } \seealso{ \code{\link{mst}()} for the minimum spanning tree routines. \code{\link{adjusted_rand_score}()} (amongst others) for external cluster validity measures (partition similarity scores). } genieclust/man/comparing_partitions.Rd0000644000176200001440000001047214040205270017707 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/RcppExports.R \name{adjusted_rand_score} \alias{adjusted_rand_score} \alias{rand_score} \alias{adjusted_fm_score} \alias{fm_score} \alias{mi_score} \alias{normalized_mi_score} \alias{adjusted_mi_score} \alias{normalized_accuracy} \alias{pair_sets_index} \title{Pairwise Partition Similarity Scores} \usage{ adjusted_rand_score(x, y = NULL) rand_score(x, y = NULL) adjusted_fm_score(x, y = NULL) fm_score(x, y = NULL) mi_score(x, y = NULL) normalized_mi_score(x, y = NULL) adjusted_mi_score(x, y = NULL) normalized_accuracy(x, y = NULL) pair_sets_index(x, y = NULL) } \arguments{ \item{x}{an integer vector of length n (or an object coercible to) representing a K-partition of an n-set, or a confusion matrix with K rows and L columns (see \code{table(x, y)})} \item{y}{an integer vector of length n (or an object coercible to) representing an L-partition of the same set), or NULL (if x is an K*L confusion matrix)} } \value{ A single real value giving the similarity score. } \description{ Let \code{x} and \code{y} represent two partitions of a set of \eqn{n} elements into, respectively, \eqn{K} and \eqn{L} nonempty and pairwise disjoint subsets. For instance, these can be two clusterings of a dataset with \eqn{n} observations specified by two vectors of labels. The functions described in this section quantify the similarity between \code{x} and \code{y}. They can be used as external cluster validity measures, i.e., in the presence of reference (ground-truth) partitions. } \details{ Every index except \code{mi_score()} (which computes the mutual information score) outputs 1 given two identical partitions. Note that partitions are always defined up to a bijection of the set of possible labels, e.g., (1, 1, 2, 1) and (4, 4, 2, 4) represent the same 2-partition. \code{rand_score()} gives the Rand score (the "probability" of agreement between the two partitions) and \code{adjusted_rand_score()} is its version corrected for chance, see (Hubert, Arabie, 1985), its expected value is 0.0 given two independent partitions. Due to the adjustment, the resulting index might also be negative for some inputs. Similarly, \code{fm_score()} gives the Fowlkes-Mallows (FM) score and \code{adjusted_fm_score()} is its adjusted-for-chance version, see (Hubert, Arabie, 1985). Note that both the (unadjusted) Rand and FM scores are bounded from below by \eqn{1/(K+1)} if \eqn{K=L}, hence their adjusted versions are preferred. \code{mi_score()}, \code{adjusted_mi_score()} and \code{normalized_mi_score()} are information-theoretic scores, based on mutual information, see the definition of \eqn{AMI_{sum}} and \eqn{NMI_{sum}} in (Vinh et al., 2010). \code{normalized_accuracy()} is defined as \eqn{(Accuracy(C_\sigma)-1/L)/(1-1/L)}, where \eqn{C_\sigma} is a version of the confusion matrix for given \code{x} and \code{y}, \eqn{K \leq L}, with columns permuted based on the solution to the Maximal Linear Sum Assignment Problem. \eqn{Accuracy(C_\sigma)} is sometimes referred to as Purity, e.g., in (Rendon et al. 2011). \code{pair_sets_index()} gives the Pair Sets Index (PSI) adjusted for chance (Rezaei, Franti, 2016), \eqn{K \leq L}. Pairing is based on the solution to the Linear Sum Assignment Problem of a transformed version of the confusion matrix. } \examples{ y_true <- iris[[5]] y_pred <- kmeans(as.matrix(iris[1:4]), 3)$cluster adjusted_rand_score(y_true, y_pred) rand_score(table(y_true, y_pred)) # the same adjusted_fm_score(y_true, y_pred) fm_score(y_true, y_pred) mi_score(y_true, y_pred) normalized_mi_score(y_true, y_pred) adjusted_mi_score(y_true, y_pred) normalized_accuracy(y_true, y_pred) pair_sets_index(y_true, y_pred) } \references{ Hubert L., Arabie P., Comparing Partitions, Journal of Classification 2(1), 1985, 193-218, esp. Eqs. (2) and (4). Rendon E., Abundez I., Arizmendi A., Quiroz E.M., Internal versus external cluster validation indexes, International Journal of Computers and Communications 5(1), 2011, 27-34. Rezaei M., Franti P., Set matching measures for external cluster validity, IEEE Transactions on Knowledge and Data Mining 28(8), 2016, 2173-2186. Vinh N.X., Epps J., Bailey J., Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance, Journal of Machine Learning Research 11, 2010, 2837-2854. } genieclust/man/inequity.Rd0000644000176200001440000000515613703563365015347 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/RcppExports.R \name{gini_index} \alias{gini_index} \alias{bonferroni_index} \title{Inequity (Inequality) Measures} \usage{ gini_index(x) bonferroni_index(x) } \arguments{ \item{x}{numeric vector of non-negative values} } \value{ The value of the inequity index, a number in \eqn{[0, 1]}. } \description{ \code{gini_index()} gives the normalised Gini index and \code{bonferroni_index()} implements the Bonferroni index. } \details{ Both indices can be used to quantify the "inequity" of a numeric sample. They can be perceived as measures of data dispersion. For constant vectors (perfect equity), the indices yield values of 0. Vectors with all elements but one equal to 0 (perfect inequity), are assigned scores of 1. Both indices follow the Pigou-Dalton principle (are Schur-convex): setting \eqn{x_i = x_i - h} and \eqn{x_j = x_j + h} with \eqn{h > 0} and \eqn{x_i - h \geq x_j + h} (taking from the "rich" and giving to the "poor") decreases the inequity. These indices have applications in economics, amongst others. The Gini clustering algorithm uses the Gini index as a measure of the inequality of cluster sizes. The normalised Gini index is given by: \deqn{ G(x_1,\dots,x_n) = \frac{ \sum_{i=1}^{n-1} \sum_{j=i+1}^n |x_i-x_j| }{ (n-1) \sum_{i=1}^n x_i }. } The normalised Bonferroni index is given by: \deqn{ B(x_1,\dots,x_n) = \frac{ \sum_{i=1}^{n} (n-\sum_{j=1}^i \frac{n}{n-j+1}) x_{\sigma(n-i+1)} }{ (n-1) \sum_{i=1}^n x_i }. } Time complexity: \eqn{O(n)} for sorted (increasingly) data. Otherwise, the vector will be sorted. In particular, for ordered inputs, it holds: \deqn{ G(x_1,\dots,x_n) = \frac{ \sum_{i=1}^{n} (n-2i+1) x_{\sigma(n-i+1)} }{ (n-1) \sum_{i=1}^n x_i }, } where \eqn{\sigma} is an ordering permutation of \eqn{(x_1,\dots,x_n)}. } \examples{ gini_index(c(2, 2, 2, 2, 2)) # no inequality gini_index(c(0, 0, 10, 0, 0)) # one has it all gini_index(c(7, 0, 3, 0, 0)) # give to the poor, take away from the rich gini_index(c(6, 0, 3, 1, 0)) # (a.k.a. Pigou-Dalton principle) bonferroni_index(c(2, 2, 2, 2, 2)) bonferroni_index(c(0, 0, 10, 0, 0)) bonferroni_index(c(7, 0, 3, 0, 0)) bonferroni_index(c(6, 0, 3, 1, 0)) } \references{ Bonferroni C., Elementi di Statistica Generale, Libreria Seber, Firenze, 1930. Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, pp. 8-23. doi:10.1016/j.ins.2016.05.003 Gini C., Variabilita e Mutabilita, Tipografia di Paolo Cuppini, Bologna, 1912. } genieclust/DESCRIPTION0000644000176200001440000000420114040221650014111 0ustar liggesusersPackage: genieclust Type: Package Title: The Genie++ Hierarchical Clustering Algorithm with Noise Points Detection Version: 1.0.0 Date: 2021-04-22 Authors@R: c( person("Marek", "Gagolewski", role = c("aut", "cre", "cph"), email = "marek@gagolewski.com", comment = c(ORCID = "0000-0003-0637-6028")), person("Maciej", "Bartoszuk", role = c("ctb")), person("Anna", "Cena", role = c("ctb")), person("Peter M.", "Larsen", role = c("ctb")) ) Description: A retake on the Genie algorithm - a robust hierarchical clustering method (Gagolewski, Bartoszuk, Cena, 2016 ). Now faster and more memory efficient; determining the whole hierarchy for datasets of 10M points in low dimensional Euclidean spaces or 100K points in high-dimensional ones takes only 1-2 minutes. Allows clustering with respect to mutual reachability distances so that it can act as a noise point detector or a robustified version of 'HDBSCAN*' (that is able to detect a predefined number of clusters and hence it does not dependent on the somewhat fragile 'eps' parameter). The package also features an implementation of economic inequity indices (the Gini, Bonferroni index) and external cluster validity measures (partition similarity scores; e.g., the adjusted Rand, Fowlkes-Mallows, adjusted mutual information, pair sets index). See also the 'Python' version of 'genieclust' available on 'PyPI', which supports sparse data, more metrics, and even larger datasets. BugReports: https://github.com/gagolews/genieclust/issues URL: https://genieclust.gagolewski.com/ License: AGPL-3 Imports: Rcpp (>= 1.0.4), stats, utils Suggests: datasets, mlpack LinkingTo: Rcpp Encoding: UTF-8 SystemRequirements: OpenMP, C++11 RoxygenNote: 7.1.1 NeedsCompilation: yes Packaged: 2021-04-22 05:43:20 UTC; gagolews Author: Marek Gagolewski [aut, cre, cph] (), Maciej Bartoszuk [ctb], Anna Cena [ctb], Peter M. Larsen [ctb] Maintainer: Marek Gagolewski Repository: CRAN Date/Publication: 2021-04-22 07:20:08 UTC genieclust/src/0000755000176200001440000000000014040206370013177 5ustar liggesusersgenieclust/src/c_mst.h0000644000176200001440000004174114002205527014465 0ustar liggesusers/* Minimum Spanning Tree Algorithms: * a. Prim-Jarník's for Complete Undirected Graphs, * b. Kruskal's for k-NN graphs. * * Copyleft (C) 2018-2021, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_mst_h #define __c_mst_h #include "c_common.h" #include #include // #include // #include #include #include "c_argfuns.h" #include "c_disjoint_sets.h" #include "c_distance.h" #ifdef _OPENMP void Comp_set_num_threads(ssize_t n_threads) { if (n_threads <= 0) n_threads = omp_get_max_threads(); omp_set_num_threads(n_threads); } #else void Comp_set_num_threads(ssize_t /*n_threads*/) { ; } #endif /*! Represents an undirected edge in a weighted graph. * Main purpose: a comparer used to sort MST edges w.r.t. increasing weights */ template class CMstTriple { public: ssize_t i1; //!< first vertex defining an edge ssize_t i2; //!< second vertex defining an edge T d; //!< edge weight CMstTriple() {} CMstTriple(ssize_t i1, ssize_t i2, T d, bool order=true) { this->d = d; if (!order || (i1 < i2)) { this->i1 = i1; this->i2 = i2; } else { this->i1 = i2; this->i2 = i1; } } bool operator<(const CMstTriple& other) const { if (d == other.d) { if (i1 == other.i1) return i2 < other.i2; else return i1 < other.i1; } else return d < other.d; } }; /*! Computes a minimum spanning forest of a near-neighbour * graph using Kruskal's algorithm, and orders * its edges w.r.t. increasing weights. * * It is assumed that each point can have a different number of nearest * neighbours determined and hence the input graph is given in a "list"-like * form. * * Note that, in general, an MST of the (<=k)-nearest neighbour graph * might not be equal to the MST of the complete Pairwise Distances Graph. * * * @param nns [in/out] a c_contiguous array, shape (c,), of CMstTriple elements * defining the near-neighbour graphs. Loops are ignored. * The array is sorted in-place. * @param c number of elements in `nns`. * @param n number of nodes in the graph * @param mst_dist [out] c_contiguous vector of length n-1, gives weights of the * resulting MST edges in nondecreasing order; * refer to the function's return value for the actual number * of edges generated (if this is < n-1, the object is padded with INFTY) * @param mst_ind [out] c_contiguous matrix of size (n-1)*2, defining the edges * corresponding to mst_d, with mst_i[j,0] <= mst_i[j,1] for all j; * refer to the function's return value for the actual number * of edges generated (if this is < n-1, the object is padded with -1) * @param verbose output diagnostic/progress messages? * * @return number of edges in the minimal spanning forest */ // template // ssize_t Cmst_from_nn_list(CMstTriple* nns, ssize_t c, // ssize_t n, T* mst_dist, ssize_t* mst_ind, bool verbose=false) // { // if (n <= 0) throw std::domain_error("n <= 0"); // if (c <= 0) throw std::domain_error("c <= 0"); // // if (verbose) // GENIECLUST_PRINT_int("[genieclust] Computing the MST... %3d%%", 0); // // std::sort(nns, nns+c); // unstable sort (do we need stable here?) // // ssize_t triple_cur = 0; // ssize_t mst_edge_cur = 0; // // CDisjointSets ds(n); // while (mst_edge_cur < n-1) { // if (triple_cur == c) { // // The input graph is not connected (we have a forest) // ssize_t ret = mst_edge_cur; // while (mst_edge_cur < n-1) { // mst_ind[2*mst_edge_cur+0] = -1; // mst_ind[2*mst_edge_cur+1] = -1; // mst_dist[mst_edge_cur] = INFTY; // mst_edge_cur++; // } // if (verbose) // GENIECLUST_PRINT_int("\b\b\b\b%3d%%", mst_edge_cur*100/(n-1)); // return ret; // } // // ssize_t u = nns[triple_cur].i1; // ssize_t v = nns[triple_cur].i2; // T d = nns[triple_cur].d; // triple_cur++; // // if (u > v) std::swap(u, v); // assure u < v // if (u < 0 || ds.find(u) == ds.find(v)) // continue; // // mst_ind[2*mst_edge_cur+0] = u; // mst_ind[2*mst_edge_cur+1] = v; // mst_dist[mst_edge_cur] = d; // // GENIECLUST_ASSERT(mst_edge_cur == 0 || mst_dist[mst_edge_cur] >= mst_dist[mst_edge_cur-1]); // // ds.merge(u, v); // mst_edge_cur++; // // // if (verbose) // GENIECLUST_PRINT_int("\b\b\b\b%3d%%", mst_edge_cur*100/(n-1)); // // #if GENIECLUST_R // Rcpp::checkUserInterrupt(); // #elif GENIECLUST_PYTHON // if (PyErr_CheckSignals() != 0) throw std::runtime_error("signal caught"); // #endif // } // // if (verbose) GENIECLUST_PRINT("\b\b\b\bdone.\n"); // // return mst_edge_cur; // } /*! Computes a minimum spanning forest of a (<=k)-nearest neighbour * (i.e., one that consists of 1-, 2-, ..., k-neighbours = the first k * nearest neighbours) graph using Kruskal's algorithm, and orders * its edges w.r.t. increasing weights. * * Note that, in general, an MST of the (<=k)-nearest neighbour graph * might not be equal to the MST of the complete Pairwise Distances Graph. * * It is assumed that each query point is not its own neighbour. * * @param dist a c_contiguous array, shape (n,k), * dist[i,j] gives the weight of the (undirected) edge {i, ind[i,j]} * @param ind a c_contiguous array, shape (n,k), * (undirected) edge definition, interpreted as {i, ind[i,j]}; * negative indices as well as those such that ind[i,j]==i are ignored * @param d_core "core" distance (or NULL); * if not NULL then the distance between 2 points will be * d(i, ind[i,j]) = max(d(i, ind[i,j]), d_core[i], d_core[ind[i,j]]) * @param n number of nodes * @param k minimal degree of all the nodes * @param mst_dist [out] c_contiguous vector of length n-1, gives weights of the * resulting MST edges in nondecreasing order; * refer to the function's return value for the actual number * of edges generated (if this is < n-1, the object is padded with INFTY) * @param mst_ind [out] c_contiguous matrix of size (n-1)*2, defining the edges * corresponding to mst_d, with mst_i[j,0] <= mst_i[j,1] for all j; * refer to the function's return value for the actual number * of edges generated (if this is < n-1, the object is padded with -1) * @param maybe_inexact [out] true indicates that k should be increased to * guarantee that the resulting tree would be the same if a complete * pairwise distance graph was given. * @param verbose output diagnostic/progress messages? * * @return number of edges in the minimal spanning forest */ template ssize_t Cmst_from_nn( const T* dist, const ssize_t* ind, const T* d_core, ssize_t n, ssize_t k, T* mst_dist, ssize_t* mst_ind, bool* maybe_inexact, bool verbose=false) { if (n <= 0) throw std::domain_error("n <= 0"); if (k <= 0) throw std::domain_error("k <= 0"); if (k >= n) throw std::domain_error("k >= n"); ssize_t nk = n*k; if (verbose) GENIECLUST_PRINT_int("[genieclust] Computing the MST... %3d%%", 0); std::vector< CMstTriple > nns(nk); ssize_t c = 0; for (ssize_t i = 0; i < n; ++i) { for (ssize_t j = 0; j < k; ++j) { ssize_t i2 = ind[k*i+j]; if (i2 >= 0 && i2 != i) { double d = dist[k*i+j]; if (d_core) { // d(i, i2) = max(d(i,i2), d_core[i], d_core[i2]) if (d < d_core[i]) d = d_core[i]; if (d < d_core[i2]) d = d_core[i2]; } nns[c++] = CMstTriple(i, i2, d, true); } } } std::stable_sort(nns.data(), nns.data()+c); ssize_t triple_cur = 0; ssize_t mst_edge_cur = 0; CDisjointSets ds(n); while (mst_edge_cur < n-1) { if (triple_cur == c) { // The input graph is not connected (we have a forest) ssize_t ret = mst_edge_cur; while (mst_edge_cur < n-1) { mst_ind[2*mst_edge_cur+0] = -1; mst_ind[2*mst_edge_cur+1] = -1; mst_dist[mst_edge_cur] = INFTY; mst_edge_cur++; } if (verbose) GENIECLUST_PRINT_int("\b\b\b\b%3d%%", mst_edge_cur*100/(n-1)); return ret; } ssize_t u = nns[triple_cur].i1; ssize_t v = nns[triple_cur].i2; T d = nns[triple_cur].d; triple_cur++; if (ds.find(u) == ds.find(v)) continue; mst_ind[2*mst_edge_cur+0] = u; mst_ind[2*mst_edge_cur+1] = v; mst_dist[mst_edge_cur] = d; GENIECLUST_ASSERT(mst_edge_cur == 0 || mst_dist[mst_edge_cur] >= mst_dist[mst_edge_cur-1]); ds.merge(u, v); mst_edge_cur++; if (verbose) GENIECLUST_PRINT_int("\b\b\b\b%3d%%", mst_edge_cur*100/(n-1)); #if GENIECLUST_R Rcpp::checkUserInterrupt(); #elif GENIECLUST_PYTHON if (PyErr_CheckSignals() != 0) throw std::runtime_error("signal caught"); #endif } if (verbose) GENIECLUST_PRINT("\b\b\b\bdone.\n"); return mst_edge_cur; } /*! Determine the first k nearest neighbours of each point. * * Exactly n*(n-1)/2 distance computations are performed. * * It is assumed that each query point is not its own neighbour. * * Worst-case time complexity: O(n*(n-1)/2*d*k) * * * @param D a callable CDistance object such that a call to * D(j, M, ssize_t l) returns an n-ary array * with the distances from the j-th point to l points whose indices * are given in array M * @param n number of points * @param k number of nearest neighbours, * @param dist [out] a c_contiguous array, shape (n,k), * dist[i,j] gives the weight of the (undirected) edge {i, ind[i,j]} * @param ind [out] a c_contiguous array, shape (n,k), * (undirected) edge definition, interpreted as {i, ind[i,j]} * @param verbose output diagnostic/progress messages? */ template void Cknn_from_complete(CDistance* D, ssize_t n, ssize_t k, T* dist, ssize_t* ind, bool verbose=false) { if (n <= 0) throw std::domain_error("n <= 0"); if (k <= 0) throw std::domain_error("k <= 0"); if (k >= n) throw std::domain_error("k >= n"); if (verbose) GENIECLUST_PRINT_int("[genieclust] Computing the K-nn graph... %3d%%", 0); for (ssize_t i=0; i M(n); for (ssize_t i=0; i 0 && dij[j] < dist[i*k+l-1]) { dist[i*k+l] = dist[i*k+l-1]; ind[i*k+l] = ind[i*k+l-1]; l -= 1; } dist[i*k+l] = dij[j]; ind[i*k+l] = j; } if (dij[j] < dist[j*k+k-1]) { // i might be amongst k-NNs of j ssize_t l = k-1; while (l > 0 && dij[j] < dist[j*k+l-1]) { dist[j*k+l] = dist[j*k+l-1]; ind[j*k+l] = ind[j*k+l-1]; l -= 1; } dist[j*k+l] = dij[j]; ind[j*k+l] = i; } } if (verbose) GENIECLUST_PRINT_int("\b\b\b\b%3d%%", (n-1+n-i-1)*(i+1)*100/n/(n-1)); #if GENIECLUST_R Rcpp::checkUserInterrupt(); #elif GENIECLUST_PYTHON if (PyErr_CheckSignals() != 0) throw std::runtime_error("signal caught"); #endif } if (verbose) GENIECLUST_PRINT("\b\b\b\bdone.\n"); } /*! A Jarník (Prim/Dijkstra)-like algorithm for determining * a(*) minimum spanning tree (MST) of a complete undirected graph * with weights given by, e.g., a symmetric n*n matrix. * * However, the distances can be computed on the fly, so that O(n) memory is used. * * (*) Note that there might be multiple minimum trees spanning a given graph. * * * References: * ---------- * * M. Gagolewski, M. Bartoszuk, A. Cena, * Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, * Information Sciences 363 (2016) 8–23. * * V. Jarník, O jistém problému minimálním, * Práce Moravské Přírodovědecké Společnosti 6 (1930) 57–63. * * C.F. Olson, Parallel algorithms for hierarchical clustering, * Parallel Comput. 21 (1995) 1313–1325. * * R. Prim, Shortest connection networks and some generalisations, * Bell Syst. Tech. J. 36 (1957) 1389–1401. * * * @param D a callable CDistance object such that a call to * D(j, M, ssize_t k) returns an n-ary array * with the distances from the j-th point to k points whose indices * are given in array M * @param n number of points * @param mst_d [out] vector of length n-1, gives weights of the * resulting MST edges in nondecreasing order * @param mst_i [out] vector of length 2*(n-1), representing * a c_contiguous array of shape (n-1,2), defining the edges * corresponding to mst_d, with mst_i[j,0] < mst_i[j,1] for all j * @param verbose output diagnostic/progress messages? */ template void Cmst_from_complete(CDistance* D, ssize_t n, T* mst_dist, ssize_t* mst_ind, bool verbose=false) { std::vector Dnn(n, INFTY); std::vector Fnn(n); std::vector M(n); std::vector< CMstTriple > res(n-1); for (ssize_t i=0; i 0); GENIECLUST_ASSERT(Fnn[bestj] != bestj); lastj = bestj; // start from bestj next time // M[bestjpos] = M[n-i-1]; // don't visit bestj again // (#62) new version: keep M sorted (more CPU cache-friendly): for (ssize_t j=bestjpos; j(Fnn[bestj], bestj, Dnn[bestj], true); if (verbose) GENIECLUST_PRINT_int("\b\b\b\b%3d%%", (n-1+n-i-1)*(i+1)*100/n/(n-1)); #if GENIECLUST_R Rcpp::checkUserInterrupt(); #elif GENIECLUST_PYTHON if (PyErr_CheckSignals() != 0) throw std::runtime_error("signal caught"); #endif } // sort the resulting MST edges in increasing order w.r.t. d std::sort(res.begin(), res.end()); for (ssize_t i=0; i * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_distance_h #define __c_distance_h #include "c_common.h" #include #include #ifdef _OPENMP #include #endif template inline T square(T x) { return x*x; } /*! Abstract base class for all distances */ template struct CDistance { virtual ~CDistance() {} /*! * @param i point index, 0<=i struct CDistancePrecomputedMatrix : public CDistance { const T* dist; ssize_t n; /*! * @param dist n*n c_contiguous array, dist[i,j] is the distance between * the i-th and the j-th point, the matrix is symmetric * @param n number of points */ CDistancePrecomputedMatrix(const T* dist, ssize_t n) { this->n = n; this->dist = dist; } CDistancePrecomputedMatrix() : CDistancePrecomputedMatrix(NULL, 0) { } virtual const T* operator()(ssize_t i, const ssize_t* /*M*/, ssize_t /*k*/) { return &this->dist[i*n]; // the i-th row of dist } }; /*! A class to "compute" the distances from the i-th point * to all n points based on a pre-computed a vector-form * c_contiguous distance vector. */ template struct CDistancePrecomputedVector : public CDistance { const T* dist; ssize_t n; std::vector buf; /*! * @param dist n*(n-1)/2 c_contiguous vector, dist[ i*n - i*(i+1)/2 + w-i-1 ] * where w is the distance between the i-th and the w-th point * @param n number of points */ CDistancePrecomputedVector(const T* dist, ssize_t n) : buf(n) { this->n = n; this->dist = dist; } CDistancePrecomputedVector() : CDistancePrecomputedVector(NULL, 0) { } virtual const T* operator()(ssize_t i, const ssize_t* M, ssize_t k) { T* __buf = buf.data(); for (ssize_t j=0; j struct CDistanceEuclidean : public CDistance { const T* X; ssize_t n; ssize_t d; std::vector buf; /*! * @param X n*d c_contiguous array * @param n number of points * @param d dimensionality */ CDistanceEuclidean(const T* X, ssize_t n, ssize_t d) : buf(n) { this->n = n; this->d = d; this->X = X; } CDistanceEuclidean() : CDistanceEuclidean(NULL, 0, 0) { } virtual const T* operator()(ssize_t i, const ssize_t* M, ssize_t k) { T* __buf = buf.data(); const T* x = X+d*i; #ifdef _OPENMP #pragma omp parallel for schedule(static) #endif for (ssize_t j=0; j struct CDistanceEuclideanSquared : public CDistance { const T* X; ssize_t n; ssize_t d; std::vector buf; /*! * @param X n*d c_contiguous array * @param n number of points * @param d dimensionality */ CDistanceEuclideanSquared(const T* X, ssize_t n, ssize_t d) : buf(n) { this->n = n; this->d = d; this->X = X; } CDistanceEuclideanSquared() : CDistanceEuclideanSquared(NULL, 0, 0) { } virtual const T* operator()(ssize_t i, const ssize_t* M, ssize_t k) { T* __buf = buf.data(); const T* x = X+d*i; #ifdef _OPENMP #pragma omp parallel for schedule(static) #endif for (ssize_t j=0; j struct CDistanceManhattan : public CDistance { const T* X; ssize_t n; ssize_t d; std::vector buf; /*! * @param X n*d c_contiguous array * @param n number of points * @param d dimensionality */ CDistanceManhattan(const T* X, ssize_t n, ssize_t d) : buf(n) { this->n = n; this->d = d; this->X = X; } CDistanceManhattan() : CDistanceManhattan(NULL, 0, 0) { } virtual const T* operator()(ssize_t i, const ssize_t* M, ssize_t k) { T* __buf = buf.data(); #ifdef _OPENMP #pragma omp parallel for schedule(static) #endif for (ssize_t j=0; j=0 && w struct CDistanceCosine : public CDistance { const T* X; ssize_t n; ssize_t d; std::vector buf; std::vector norm; /*! * @param X n*d c_contiguous array * @param n number of points * @param d dimensionality */ CDistanceCosine(const T* X, ssize_t n, ssize_t d) : buf(n), norm(n) { this->n = n; this->d = d; this->X = X; T* __norm = norm.data(); #ifdef _OPENMP #pragma omp parallel for schedule(static) #endif for (ssize_t i=0; i=0&&w struct CDistanceMutualReachability : public CDistance { ssize_t n; CDistance* d_pairwise; std::vector buf; std::vector d_core; CDistanceMutualReachability(const T* _d_core, ssize_t n, CDistance* d_pairwise) : buf(n), d_core(_d_core, _d_core+n) { this->n = n; this->d_pairwise = d_pairwise; } CDistanceMutualReachability() : CDistanceMutualReachability(NULL, 0, NULL) { } virtual const T* operator()(ssize_t i, const ssize_t* M, ssize_t k) { // pragma omp parallel for inside:: const T* d = (*d_pairwise)(i, M, k); T* __buf = buf.data(); #ifdef _OPENMP #pragma omp parallel for schedule(static) #endif for (ssize_t j=0; j __buf[w]) __buf[w] = d_core[i]; if (d_core[w] > __buf[w]) __buf[w] = d_core[w]; } } return __buf; } }; #endif genieclust/src/c_common.h0000644000176200001440000000344313774712421015162 0ustar liggesusers/* Common functions, macros, includes * * Copyleft (C) 2018-2021, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_common_h #define __c_common_h #ifdef Py_PYTHON_H #define GENIECLUST_PYTHON 1 #endif #ifdef GENIECLUST_R #undef GENIECLUST_R #define GENIECLUST_R 1 #endif #include #include #include #ifdef _OPENMP #include #define OPENMP_ENABLED 1 #else #define OPENMP_ENABLED 0 #endif #ifndef GENIECLUST_ASSERT #define __GENIECLUST_STR(x) #x #define GENIECLUST_STR(x) __GENIECLUST_STR(x) #define GENIECLUST_ASSERT(EXPR) { if (!(EXPR)) \ throw std::runtime_error( "genieclust: Assertion " #EXPR " failed in "\ __FILE__ ":" GENIECLUST_STR(__LINE__) ); } #endif #if GENIECLUST_R #include #else #include #endif #if GENIECLUST_R #define GENIECLUST_PRINT(fmt) REprintf((fmt)); #else #define GENIECLUST_PRINT(fmt) fprintf(stderr, (fmt)); #endif #if GENIECLUST_R #define GENIECLUST_PRINT_int(fmt, val) REprintf((fmt), (int)(val)); #else #define GENIECLUST_PRINT_int(fmt, val) fprintf(stderr, (fmt), (int)(val)); #endif #ifndef INFTY #define INFTY (std::numeric_limits::infinity()) #endif #endif genieclust/src/r_compare_partitions.cpp0000644000176200001440000002265114040205131020126 0ustar liggesusers/* Partition Similarity Scores * * Copyleft (C) 2018-2021, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #include "c_compare_partitions.h" #include #include #include using namespace Rcpp; /** Extract or compute the contingency matrix based on given arguments * * @param x vector or contingency table (matrix) * @param y R_NilValue or vector of size x.size() if x is not a matrix * @param xc [out] * @param yc [out] * * @return flat, contiguous c_style vector representing the contingency table * with xc rows and yc columns */ std::vector get_contingency_matrix(RObject x, RObject y, ssize_t* xc, ssize_t* yc) { if (Rf_isMatrix(x)) { if (!Rf_isNull(y)) stop("if x is a contingency matrix, y must be NULL"); if (!(Rf_isInteger(x) | Rf_isReal(x))) stop("x must be of type numeric"); IntegerMatrix X(x); *xc = X.nrow(); *yc = X.ncol(); std::vector C((*xc)*(*yc)); ssize_t k=0; for (ssize_t i=0; i<*xc; ++i) for (ssize_t j=0; j<*yc; ++j) C[k++] = X(i, j); // Fortran -> C-style return C; } else { if (Rf_isNull(y)) stop("if x is not a contingency matrix, y must not be NULL"); if (!(Rf_isInteger(x) | Rf_isReal(x) | Rf_isLogical(x) | Rf_isFactor(x))) stop("x must be of type numeric"); if (!(Rf_isInteger(x) | Rf_isReal(x) | Rf_isLogical(x) | Rf_isFactor(x))) stop("y must be of type numeric"); IntegerVector rx(x); IntegerVector ry(y); ssize_t n = rx.size(); if (ry.size() != n) stop("x and y must be of equal lengths"); for (ssize_t i=0; i C((*xc)*(*yc)); Ccontingency_table(C.data(), *xc, *yc, xmin, ymin, INTEGER(SEXP(rx)), INTEGER(SEXP(ry)), n); return C; } } //' @title Pairwise Partition Similarity Scores //' //' @description //' Let \code{x} and \code{y} represent two partitions of a set of \eqn{n} //' elements into, respectively, \eqn{K} and \eqn{L} //' nonempty and pairwise disjoint subsets. //' For instance, these can be two clusterings of a dataset with //' \eqn{n} observations specified by two vectors of labels. //' The functions described in this section quantify the similarity between //' \code{x} and \code{y}. They can be used as external cluster //' validity measures, i.e., in the presence of reference (ground-truth) //' partitions. //' //' @details //' Every index except \code{mi_score()} (which computes the mutual //' information score) outputs 1 given two identical partitions. //' Note that partitions are always defined up to a bijection of the set of //' possible labels, e.g., (1, 1, 2, 1) and (4, 4, 2, 4) //' represent the same 2-partition. //' //' \code{rand_score()} gives the Rand score (the "probability" of agreement //' between the two partitions) and //' \code{adjusted_rand_score()} is its version corrected for chance, //' see (Hubert, Arabie, 1985), //' its expected value is 0.0 given two independent partitions. //' Due to the adjustment, the resulting index might also be negative //' for some inputs. //' //' Similarly, \code{fm_score()} gives the Fowlkes-Mallows (FM) score //' and \code{adjusted_fm_score()} is its adjusted-for-chance version, //' see (Hubert, Arabie, 1985). //' //' Note that both the (unadjusted) Rand and FM scores are bounded from below //' by \eqn{1/(K+1)} if \eqn{K=L}, hence their adjusted versions are preferred. //' //' \code{mi_score()}, \code{adjusted_mi_score()} and //' \code{normalized_mi_score()} are information-theoretic //' scores, based on mutual information, //' see the definition of \eqn{AMI_{sum}} and \eqn{NMI_{sum}} //' in (Vinh et al., 2010). //' //' \code{normalized_accuracy()} is defined as //' \eqn{(Accuracy(C_\sigma)-1/L)/(1-1/L)}, where \eqn{C_\sigma} is a version //' of the confusion matrix for given \code{x} and \code{y}, //' \eqn{K \leq L}, with columns permuted based on the solution to the //' Maximal Linear Sum Assignment Problem. //' \eqn{Accuracy(C_\sigma)} is sometimes referred to as Purity, //' e.g., in (Rendon et al. 2011). //' //' \code{pair_sets_index()} gives the Pair Sets Index (PSI) //' adjusted for chance (Rezaei, Franti, 2016), \eqn{K \leq L}. //' Pairing is based on the solution to the Linear Sum Assignment Problem //' of a transformed version of the confusion matrix. //' //' @references //' Hubert L., Arabie P., Comparing Partitions, //' Journal of Classification 2(1), 1985, 193-218, esp. Eqs. (2) and (4). //' //' Rendon E., Abundez I., Arizmendi A., Quiroz E.M., //' Internal versus external cluster validation indexes, //' International Journal of Computers and Communications 5(1), 2011, 27-34. //' //' Rezaei M., Franti P., Set matching measures for external cluster validity, //' IEEE Transactions on Knowledge and Data Mining 28(8), 2016, 2173-2186. //' //' Vinh N.X., Epps J., Bailey J., //' Information theoretic measures for clusterings comparison: //' Variants, properties, normalization and correction for chance, //' Journal of Machine Learning Research 11, 2010, 2837-2854. //' //' //' @param x an integer vector of length n (or an object coercible to) //' representing a K-partition of an n-set, //' or a confusion matrix with K rows and L columns (see \code{table(x, y)}) //' //' @param y an integer vector of length n (or an object coercible to) //' representing an L-partition of the same set), //' or NULL (if x is an K*L confusion matrix) //' //' @return A single real value giving the similarity score. //' //' @examples //' y_true <- iris[[5]] //' y_pred <- kmeans(as.matrix(iris[1:4]), 3)$cluster //' adjusted_rand_score(y_true, y_pred) //' rand_score(table(y_true, y_pred)) # the same //' adjusted_fm_score(y_true, y_pred) //' fm_score(y_true, y_pred) //' mi_score(y_true, y_pred) //' normalized_mi_score(y_true, y_pred) //' adjusted_mi_score(y_true, y_pred) //' normalized_accuracy(y_true, y_pred) //' pair_sets_index(y_true, y_pred) //' //' @rdname comparing_partitions //' @export //[[Rcpp::export]] double adjusted_rand_score(RObject x, RObject y=R_NilValue) { ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); return Ccompare_partitions_pairs(C.data(), xc, yc).ar; } //' @rdname comparing_partitions //' @export //[[Rcpp::export]] double rand_score(RObject x, RObject y=R_NilValue) { ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); return Ccompare_partitions_pairs(C.data(), xc, yc).r; } //' @rdname comparing_partitions //' @export //[[Rcpp::export]] double adjusted_fm_score(RObject x, RObject y=R_NilValue) { ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); return Ccompare_partitions_pairs(C.data(), xc, yc).afm; } //' @rdname comparing_partitions //' @export //[[Rcpp::export]] double fm_score(RObject x, RObject y=R_NilValue) { ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); return Ccompare_partitions_pairs(C.data(), xc, yc).fm; } //' @rdname comparing_partitions //' @export //[[Rcpp::export]] double mi_score(RObject x, RObject y=R_NilValue) { ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); return Ccompare_partitions_info(C.data(), xc, yc).mi; } //' @rdname comparing_partitions //' @export //[[Rcpp::export]] double normalized_mi_score(RObject x, RObject y=R_NilValue) { ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); return Ccompare_partitions_info(C.data(), xc, yc).nmi; } //' @rdname comparing_partitions //' @export //[[Rcpp::export]] double adjusted_mi_score(RObject x, RObject y=R_NilValue) { ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); return Ccompare_partitions_info(C.data(), xc, yc).ami; } //' @rdname comparing_partitions //' @export //[[Rcpp::export]] double normalized_accuracy(RObject x, RObject y=R_NilValue) { ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); return Ccompare_partitions_nacc(C.data(), xc, yc); } //' @rdname comparing_partitions //' @export //[[Rcpp::export]] double pair_sets_index(RObject x, RObject y=R_NilValue) { ssize_t xc, yc; std::vector C( get_contingency_matrix(x, y, &xc, &yc) ); return Ccompare_partitions_psi(C.data(), xc, yc); } genieclust/src/c_disjoint_sets.h0000644000176200001440000000721014002205500016523 0ustar liggesusers/* class CDisjointSets * * Copyleft (C) 2018-2021, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_disjoint_sets_h #define __c_disjoint_sets_h #include "c_common.h" #include #include /*! Disjoint Sets (Union-Find) Data Structure * * A class to represent partitions of the set {0,1,...,n-1} for any n. * * Path compression for find() is implemented, * but the union() operation is naive (neither * it is union by rank nor by size), * see https://en.wikipedia.org/wiki/Disjoint-set_data_structure. * This is by design, as some other operations in the current * package rely on the assumption that the parent id of each * element is always <= than itself. */ class CDisjointSets { protected: ssize_t n; //!< number of distinct elements ssize_t k; //!< number of subsets std::vector par; /*!< par[i] is the id of the parent * of the i-th element */ public: /*! Starts with a "weak" partition { {0}, {1}, ..., {n-1} }, * i.e., n singletons. * * @param n number of elements, n>=0. */ CDisjointSets(ssize_t n) : par(n) { // if (n < 0) throw std::domain_error("n < 0"); this->n = n; this->k = n; for (ssize_t i=0; ipar[i] = i; } /*! A nullary constructor allows Cython to allocate * the instances on the stack. Do not use otherwise. */ CDisjointSets() : CDisjointSets(0) { } /*! Returns the current number of sets in the partition. */ ssize_t get_k() const { return this->k; } /*! Returns the total cardinality of the set being partitioned. */ ssize_t get_n() const { return this->n; } /*! Finds the subset id for a given x. * * @param x a value in {0,...,n-1} */ ssize_t find(ssize_t x) { if (x < 0 || x >= this->n) throw std::domain_error("x not in [0,n)"); if (this->par[x] != x) { this->par[x] = this->find(this->par[x]); } return this->par[x]; } /*! Merges the sets containing x and y. * * Let px be the parent id of x, and py be the parent id of y. * If px < py, then the new parent id of py will be set to py. * Otherwise, px will have py as its parent. * * If x and y are already members of the same subset, * an exception is thrown. * * @return the id of the parent of x or y, whichever is smaller. * * @param x a value in {0,...,n-1} * @param y a value in {0,...,n-1} */ virtual ssize_t merge(ssize_t x, ssize_t y) { // well, union is a reserved C++ keyword :) x = this->find(x); // includes a range check for x y = this->find(y); // includes a range check for y if (x == y) throw std::invalid_argument("find(x) == find(y)"); if (y < x) std::swap(x, y); this->par[y] = x; this->k -= 1; return x; } }; #endif genieclust/src/c_argfuns.h0000644000176200001440000000704514002205461015323 0ustar liggesusers/* Some sort/search/vector indexing-related functions * missing in the Standard Library, including the ones to: * a. find the (stable) ordering permutation of a vector * b. find the k-th smallest value in a vector * * Copyleft (C) 2018-2021, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_argsort_h #define __c_argsort_h #include "c_common.h" #include /*! Comparer for argsort(). * * Assures the resulting permutation is stable. */ template struct __argsort_comparer { const T* x; __argsort_comparer(const T* x) { this->x = x; } bool operator()(ssize_t i, ssize_t j) const { return this->x[i] < this->x[j] || (this->x[i] == this->x[j] && i < j); } }; /*! Finds an(*) ordering permutation w.r.t. \lt. * * Both ret and x should be of the same length n; * ret will be overwritten. * * (*) or THE stable one, if stable=true, which is the default. * * We call permutation o stable, whenever i void Cargsort(ssize_t* ret, const T* x, ssize_t n, bool stable=true) { if (n <= 0) throw std::domain_error("n <= 0"); for (ssize_t i=0; i(x)); else std::sort(ret, ret+n, __argsort_comparer(x)); } /*! Returns the index of the (k-1)-th smallest value in an array x. * * argkmin(x, 0) == argmin(x), or, more generally, * argkmin(x, k) == np.argsort(x)[k]. * * Run time: O(nk), where n == len(x). Working mem: O(k). * Does not modify x. * * In practice, very fast for small k and randomly ordered * or almost sorted (increasingly) data. * * * If buf is not NULL, it must be of length at least k+1. * * @param x data * @param n length of x * @param k value in {0,...,n-1}, preferably small * @param buf optional working buffer of size >= k+1, will be overwritten */ template ssize_t Cargkmin(const T* x, ssize_t n, ssize_t k, ssize_t* buf=NULL) { ssize_t* idx; if (n <= 0) throw std::domain_error("n <= 0"); if (k >= n) throw std::domain_error("k >= n"); k += 1; if (!buf) idx = new ssize_t[k]; else idx = buf; for (ssize_t i=0; i 0 && x[i] < x[idx[j-1]]) { idx[j] = idx[j-1]; j -= 1; } idx[j] = i; } for (ssize_t i=k; i 0 && x[i] < x[idx[j-1]]) { idx[j] = idx[j-1]; j -= 1; } idx[j] = i; } ssize_t ret = idx[k-1]; if (!buf) delete [] idx; return ret; } #endif genieclust/src/c_inequity.h0000644000176200001440000000773514002205516015534 0ustar liggesusers/* Inequity (Inequality) Measures * * Copyleft (C) 2018-2021, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_inequity_h #define __c_inequity_h #include "c_common.h" #include /*! The Normalised Gini Index * * The normalised Gini index is given by: * $$ * G(x_1,\dots,x_n) = \frac{ * \sum_{i=1}^{n-1} \sum_{j=i+1}^n |x_i-x_j| * }{ * (n-1) \sum_{i=1}^n x_i * }. * $$ * * Time complexity: $O(n)$ for sorted data; it holds: * $$ * G(x_1,\dots,x_n) = \frac{ * \sum_{i=1}^{n} (n-2i+1) x_{\sigma(n-i+1)} * }{ * (n-1) \sum_{i=1}^n x_i * }, * $$ * where $\sigma$ is an ordering permutation of $(x_1,\dots,x_n)$. * * * References * ---------- * * Gini C., Variabilita e Mutabilita, Tipografia di Paolo Cuppini, * Bologna, 1912. * * * @param x non-decreasingly sorted c_contiguous input vector >= 0 * @param n length of x * * @return the value of the inequity index, a number in [0,1]. */ template double Cgini_sorted(const T* x, ssize_t n) { double s = 0.0, t = 0.0; GENIECLUST_ASSERT(x[0] >= 0); GENIECLUST_ASSERT(x[n-1] > 0); for (ssize_t i=1; i<=n; ++i) { t += x[n-i]; s += (n-2.0*i+1.0)*x[n-i]; } s = s/(n-1.0)/t; if (s > 1.0) return 1.0; else if (s < 0.0) return 0.0; else return s; } /*! The Normalised Bonferroni Index * * The normalised Bonferroni index is given by: * $$ * B(x_1,\dots,x_n) = \frac{ * \sum_{i=1}^{n} \left( n-\sum_{j=1}^i \frac{n}{n-j+1} \right) * x_{\sigma(n-i+1)} * }{ * (n-1) \sum_{i=1}^n x_i * }, * $$ * where $\sigma$ is an ordering permutation of $(x_1,\dots,x_n)$. * * Time complexity: $O(n)$ for sorted data. * * * References * ---------- * * Bonferroni C., Elementi di Statistica Generale, Libreria Seber, * Firenze, 1930. * * @param x non-decreasingly sorted c_contiguous input vector >= 0 * @param n length of x * * @return the value of the inequity index, a number in [0,1]. */ template double Cbonferroni_sorted(const T* x, ssize_t n) { double s = 0.0, t = 0.0, c = 0.0; GENIECLUST_ASSERT(x[0] >= 0); GENIECLUST_ASSERT(x[n-1] > 0); for (ssize_t i=1; i<=n; ++i) { c += n/(n-i+1.0); t += x[n-i]; s += (n-c)*x[n-i]; } s = s/(n-1.0)/t; if (s > 1.0) return 1.0; else if (s < 0.0) return 0.0; else return s; } // #cpdef np.float64_t coefvar(np.ndarray[T] x, bint is_sorted=False): // #""" // #Coefficient of variation // // #$$ // #C(x_1,\dots,x_n) = \sqrt{\frac{ // #\sum_{i=1}^{n-1} \sum_{j=i+1}^n (x_i-x_j)^2 // #}{ // #(n-1) \sum_{i=1}^n x_i^2 // #}}. // #$$ // // #Is this an inequity measures BTW? // #""" // // ## sorting is not necessary // #cdef unsigned int n = len(x) // #cdef np.float64_t s = 0.0, t = square(x[0]) // #cdef unsigned int i, j // // #for i in range(n-1): // #t += square(x[i+1]) // #for j in range(i+1, n): // #s += square(x[i]-x[j]) // // #return cmath.sqrt(s/(n-1.0)/t) // // // # cpdef np.float64_t vergottini(np.ndarray[T] x, bint is_sorted=False): // # "de Vergottini index // # x <- sort(x, decreasing=TRUE) // # n <- length(x) // # vmax <- sum(1/(2:n)) // # (sum(sapply(1:length(x), function(i) mean(x[1:i])))/sum(x)-1)/vmax // # } #endif genieclust/src/Makevars0000644000176200001440000000015313672011632014677 0ustar liggesusersCXX_STD = CXX11 PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) -DGENIECLUST_R PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS) genieclust/src/c_matrix.h0000644000176200001440000000676114002205524015166 0ustar liggesusers/* Lightweight matrix class - KISS * * Copyleft (C) 2018-2021, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_matrix_h #define __c_matrix_h #include /** * Represents a matrix as a C-contiguous array, * i.e., in a row-major order. */ template class matrix { private: size_t n, d; std::vector elems; public: /** Initialises a new matrix of size _nrow*_ncol, filled with 0s * * @param _nrow * @param _ncol */ matrix(size_t _nrow, size_t _ncol) : n(_nrow), d(_ncol), elems(_nrow*_ncol) { ; } /** Initialises a new matrix of size _nrow*_ncol, filled with _ts * * @param _nrow * @param _ncol * @param _t */ matrix(size_t _nrow, size_t _ncol, T _t) : n(_nrow), d(_ncol), elems(_nrow*_ncol, _t) { ; } /** Initialises a new matrix of size _nrow*_ncol based on a contiguous * C- or Fortran-style array * * @param _data * @param _nrow * @param _ncol * @param _c_order whether the first _ncol elements in _data constitute the first row * or the first _nrow elements define the first column */ template matrix(const S* _data, size_t _nrow, size_t _ncol, bool _c_order) : n(_nrow), d(_ncol), elems(_nrow*_ncol) { if (_c_order) { for (size_t i=0; i<_nrow*_ncol; ++i) elems[i] = (T)(_data[i]); } else { size_t k = 0; for (size_t i=0; i<_nrow; i++) { for (size_t j=0; j<_ncol; j++) { elems[k++] = (T)_data[i+_nrow*j]; } } } } /** Read/write access to an element in the i-th row and the j-th column * * @param i * @param j * @return a reference to the indicated matrix element */ T& operator()(const size_t i, const size_t j) { return elems[d*i + j]; } const T& operator()(const size_t i, const size_t j) const { return elems[d*i + j]; } /** Returns a direct pointer to the underlying C-contiguous data array: * the first ncol elements give the 1st row, * the next ncol element give the 2nd row, * and so forth. * * @return pointer */ T* data() { return elems.data(); } const T* data() const { return elems.data(); } /** Returns a direct pointer to the start of the i-th row * * @param i * @return pointer */ T* row(const size_t i) { return elems.data()+i*d; } const T* row(const size_t i) const { return elems.data()+i*d; } /** Returns the number of rows * * @return */ size_t nrow() const { return n; } /** Returns the number of columns * * @return */ size_t ncol() const { return d; } }; #endif genieclust/src/c_genie.h0000644000176200001440000010121014002205507014733 0ustar liggesusers/* The Genie++ Clustering Algorithm * * Copyleft (C) 2018-2021, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_genie_h #define __c_genie_h #include "c_common.h" #include #include #include #include #include "c_gini_disjoint_sets.h" #include "c_int_dict.h" #include "c_preprocess.h" /*! Base class for CGenie and CGIc */ template class CGenieBase { protected: /*! Stores the clustering result as obtained by * CGenie::apply_genie() or CGIc::apply_gic() */ struct CGenieResult { CGiniDisjointSets ds; /*!< ds at the last iteration, it; * use denoise_index to obtain the final partition */ std::vector links; // deg; // denoise_index; // denoise_index_rev; //!< reverse look-up for denoise_index CCountDisjointSets forest_components; CGenieResult results; /*! When the Genie correction is on, some MST edges will be chosen * in a non-consecutive order. An array-based skiplist will speed up * searching within the to-be-consumed edges. Also, if there are * noise points, then the skiplist allows the algorithm * to naturally ignore edges that connect the leaves. */ void mst_skiplist_init(CIntDict* mst_skiplist) { // start with a list that skips all edges that lead to noise points mst_skiplist->clear(); for (ssize_t i=0; in-1; ++i) { ssize_t i1 = this->mst_i[i*2+0]; ssize_t i2 = this->mst_i[i*2+1]; GENIECLUST_ASSERT(i1 < this->n) GENIECLUST_ASSERT(i2 < this->n) if (i1 < 0 || i2 < 0) { continue; // a no-edge -> ignore } if (!this->noise_leaves || (this->deg[i1]>1 && this->deg[i2]>1)) { (*mst_skiplist)[i] = i; /*only the key is important, not the value*/ } } } /** internal, used by get_labels(n_clusters, res) */ ssize_t get_labels(CGiniDisjointSets* ds, ssize_t* res) { std::vector res_cluster_id(n, -1); ssize_t c = 0; for (ssize_t i=0; idenoise_index_rev[i] >= 0) { // a non-noise point ssize_t j = this->denoise_index[ ds->find(this->denoise_index_rev[i]) ]; if (res_cluster_id[j] < 0) { // new cluster res_cluster_id[j] = c; ++c; } res[i] = res_cluster_id[j]; } else { // a noise point res[i] = -1; } } return c; } public: CGenieBase(T* mst_d, ssize_t* mst_i, ssize_t n, bool noise_leaves) : deg(n), denoise_index(n), denoise_index_rev(n) { this->mst_d = mst_d; this->mst_i = mst_i; this->n = n; this->noise_leaves = noise_leaves; // ssize_t missing_mst_edges = 0; for (ssize_t i=0; ideg: Cget_graph_node_degrees(mst_i, n-1, n, this->deg.data()); // Create the non-noise points' translation table (for GiniDisjointSets) // and count the number of noise points if (noise_leaves) { noise_count = 0; ssize_t j = 0; for (ssize_t i=0; i= 2); GENIECLUST_ASSERT(j + noise_count == n); } else { // there are no noise points this->noise_count = 0; for (ssize_t i=0; in - this->noise_count); for (ssize_t i=0; in-1; ++i) { ssize_t i1 = this->mst_i[i*2+0]; ssize_t i2 = this->mst_i[i*2+1]; GENIECLUST_ASSERT(i1 < this->n) GENIECLUST_ASSERT(i2 < this->n) if (i1 < 0 || i2 < 0) { continue; // a no-edge -> ignore } if (!this->noise_leaves || (this->deg[i1]>1 && this->deg[i2]>1)) { forest_components.merge(this->denoise_index_rev[i1], this->denoise_index_rev[i2]); } } } /*! There can be at most n-noise_count singleton clusters * in the hierarchy. */ ssize_t get_max_n_clusters() const { return this->n - this->noise_count; } /*! Propagate res with clustering results. * * Noise points get cluster id of -1. * * @param n_clusters maximal number of clusters to find * @param res [out] c_contiguous array of length n * * @return number of clusters detected (not including the noise cluster; * can be less than n_clusters) */ ssize_t get_labels(ssize_t n_clusters, ssize_t* res) { if (this->results.ds.get_n() <= 0) throw std::runtime_error("Apply the clustering procedure first."); if (n_clusters <= this->results.n_clusters) { // use this->results.ds -- from the final iteration return this->get_labels(&(this->results.ds), res); } else { CGiniDisjointSets ds(this->get_max_n_clusters()); for (ssize_t it=0; itget_max_n_clusters() - n_clusters; ++it) { ssize_t j = (this->results.links[it]); if (j < 0) break; // remaining are no-edges ssize_t i1 = this->mst_i[2*j+0]; ssize_t i2 = this->mst_i[2*j+1]; GENIECLUST_ASSERT(i1 >= 0) GENIECLUST_ASSERT(i2 >= 0) ds.merge(this->denoise_index_rev[i1], this->denoise_index_rev[i2]); } return this->get_labels(&ds, res); } } /*! Propagate res with clustering results - * all partitions from cardinality n_clusters to 1. * * Noise points get cluster id of -1. * * @param n_clusters maximal number of clusters to find * @param res [out] c_contiguous matrix of shape (n_clusters+1, n) */ void get_labels_matrix(ssize_t n_clusters, ssize_t* res) { if (this->get_max_n_clusters() < n_clusters) { // there is nothing to do, no merge needed. throw std::runtime_error("The requested number of clusters \ is too large with this many detected noise points"); } if (this->results.ds.get_n() <= 0) throw std::runtime_error("Apply the clustering procedure first."); if (n_clusters < this->forest_components.get_k()) { n_clusters = this->forest_components.get_k(); } CGiniDisjointSets ds(this->get_max_n_clusters()); // you can do up to this->get_max_n_clusters() - 1 merges ssize_t cur_cluster = n_clusters; if (this->get_max_n_clusters() == n_clusters) { cur_cluster--; GENIECLUST_ASSERT(cur_cluster >= 0) this->get_labels(&ds, &res[cur_cluster * this->n]); } for (ssize_t it=0; itget_max_n_clusters() - 1; ++it) { ssize_t j = (this->results.links[it]); if (j >= 0) { // might not be true if forest_components.get_k() > 1 ssize_t i1 = this->mst_i[2*j+0]; ssize_t i2 = this->mst_i[2*j+1]; GENIECLUST_ASSERT(i1 >= 0 && i2 >= 0) ds.merge(this->denoise_index_rev[i1], this->denoise_index_rev[i2]); } if (it >= this->get_max_n_clusters() - n_clusters - 1) { cur_cluster--; GENIECLUST_ASSERT(cur_cluster >= 0) this->get_labels(&ds, &res[cur_cluster * this->n]); } } GENIECLUST_ASSERT(cur_cluster == 0) } /*! Propagate res with clustering results - * based on the current this->results.links. * * If there are noise points, not all elements in res will be set. * * @param res [out] c_contiguous array of length n-1, * res[i] gives the index of the MST edge merged at the i-th iteration. * * @return number of items in res set (the array is padded with -1s) */ ssize_t get_links(ssize_t* res) { if (this->results.ds.get_n() <= 0) throw std::runtime_error("Apply the clustering procedure first."); for (ssize_t i=0; iresults.it; ++i) { res[i] = this->results.links[i]; } for (ssize_t i=this->results.it; in-1; ++i) { res[i] = -1; } return this->results.it; } /*! Set res[i] to true if the i-th point is a noise one. * * Makes sense only if noise_leaves==true * * @param res [out] array of length n */ void get_noise_status(bool* res) const { for (ssize_t i=0; inoise_leaves && this->deg[i] <= 1); } } }; /*! The Genie++ Hierarchical Clustering Algorithm * * The Genie algorithm (Gagolewski et al., 2016) links two clusters * in such a way that a chosen economic inequity measure * (here, the Gini index) of the cluster sizes does not increase drastically * above a given threshold. The method most often outperforms * the Ward or average linkage, k-means, spectral clustering, * DBSCAN, Birch and others in terms of the clustering * quality on benchmark data while retaining the speed of the single * linkage algorithm. * * This is a re-implementation of the original (Gagolewski et al., 2016) * algorithm. New features include: * 1. Given a pre-computed minimum spanning tree (MST), * it only requires amortised O(n sqrt(n))-time. * 2. MST leaves can be * marked as noise points (if `noise_leaves==True`). This is useful, * if the Genie algorithm is applied on the MST with respect to * the HDBSCAN-like mutual reachability distance. * 3. (option) During merge, first pair of clusters that would * give a decrease of the Gini index below the threshold is chosen * (or the one that gives the smallest Gini index if that's not possible) * -- turns out to be slower and (TODO: testing required). * 4. The MST need not be connected (is a spanning forest) (e.g., if it * computed based on a disconnected k-NN graph) - each component * will never be merged with any other one. * * * * References * =========== * * Gagolewski M., Bartoszuk M., Cena A., * Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, * Information Sciences 363, 2016, pp. 8-23. doi:10.1016/j.ins.2016.05.003 */ template class CGenie : public CGenieBase { protected: bool new_merge; //* mst_skiplist, ssize_t n_clusters, double gini_threshold, std::vector* links) { if (n_clusters > this->get_max_n_clusters()) { // there is nothing to do, no merge needed. throw std::runtime_error("The requested number of clusters \ is too large with this many detected noise points"); } if (n_clusters < this->forest_components.get_k()) { n_clusters = this->forest_components.get_k(); // throw std::runtime_error("The requested number of clusters // is too small as the MST is not connected"); } // mst_skiplist contains all mst_i edge indexes // that we need to consider, and nothing more. GENIECLUST_ASSERT(!mst_skiplist->empty()); ssize_t lastidx = mst_skiplist->get_key_min(); ssize_t lastm = 0; // last minimal cluster size ssize_t it = 0; while (!mst_skiplist->empty() && ds->get_k() > n_clusters) { // determine the pair of vertices to merge ssize_t i1; ssize_t i2; if (ds->get_gini() > gini_threshold) { // the Genie correction for inequity of cluster sizes ssize_t m = ds->get_smallest_count(); if (m != lastm || lastidx < mst_skiplist->get_key_min()) { // need to start from the beginning of the MST skiplist lastidx = mst_skiplist->get_key_min(); } // else reuse lastidx GENIECLUST_ASSERT(lastidx < this->n - 1) GENIECLUST_ASSERT(lastidx >= 0 && lastidx < this->n - 1); GENIECLUST_ASSERT(this->mst_i[2*lastidx+0] >= 0 && this->mst_i[2*lastidx+1] >= 0); // find the MST edge connecting a cluster of the smallest size // with another one while (ds->get_count(this->denoise_index_rev[this->mst_i[2*lastidx+0]]) != m && ds->get_count(this->denoise_index_rev[this->mst_i[2*lastidx+1]]) != m) { lastidx = mst_skiplist->get_key_next(lastidx); GENIECLUST_ASSERT(lastidx >= 0 && lastidx < this->n - 1); GENIECLUST_ASSERT(this->mst_i[2*lastidx+0] >= 0 && this->mst_i[2*lastidx+1] >= 0); } i1 = this->mst_i[2*lastidx+0]; i2 = this->mst_i[2*lastidx+1]; (*links)[it] = lastidx; ssize_t delme = lastidx; lastidx = mst_skiplist->get_key_next(lastidx); mst_skiplist->erase(delme); // O(1) lastm = m; } else { // single linkage-like // note that we consume the MST edges in an non-decreasing order w.r.t. weights ssize_t curidx = mst_skiplist->pop_key_min(); GENIECLUST_ASSERT(curidx >= 0 && curidx < this->n - 1); i1 = this->mst_i[2*curidx+0]; i2 = this->mst_i[2*curidx+1]; (*links)[it] = curidx; } GENIECLUST_ASSERT(i1 >= 0 && i2 >= 0) ssize_t i1r = this->denoise_index_rev[i1]; ssize_t i2r = this->denoise_index_rev[i2]; bool forget = this->forest_components.get_k() > 1 && this->forest_components.find(i1r) == this->forest_components.find(i2r) && this->forest_components.get_count(i1r) == ds->get_count(i1r) + ds->get_count(i2r); if (forget) { ds->merge_and_forget(i1r, i2r); } else { ds->merge(i1r, i2r); } it++; } return it; // number of merges performed } /*! Run the Genie+++ partitioning -- merge a pair of sets * that reduces the Gini index below the threshold (provided that is possible) * * **EXPERIMENTAL** This is slower and (perhaps - thorough testing required) * not that awesome. * * @param ds * @param mst_skiplist * @param n_clusters maximal number of clusters to detect * @param gini_threshold * @param links [out] c_contiguous array of size (n-1), * links[iter] = index of merged mst_i (up to the number of performed * merges, see retval). * * @return The number of performed merges. */ ssize_t do_genie_new(CGiniDisjointSets* ds, CIntDict* mst_skiplist, ssize_t n_clusters, double gini_threshold, std::vector* links) { if (n_clusters > this->get_max_n_clusters()) { // there is nothing to do, no merge needed. throw std::runtime_error("The requested number of clusters \ is too large with this many detected noise points"); } if (n_clusters < this->forest_components.get_k()) { n_clusters = this->forest_components.get_k(); // throw std::runtime_error("The requested number of clusters // is too small as the MST is not connected"); } // mst_skiplist contains all mst_i edge indexes // that we need to consider, and nothing more. GENIECLUST_ASSERT(!mst_skiplist->empty()); ssize_t it = 0; while (!mst_skiplist->empty() && ds->get_k() > n_clusters) { // determine the pair of vertices to merge ssize_t last_idx = mst_skiplist->get_key_min(); double best_gini = 1.0; ssize_t best_idx = last_idx; while (1) { ssize_t i1 = this->mst_i[2*last_idx+0]; ssize_t i2 = this->mst_i[2*last_idx+1]; ssize_t i1r = this->denoise_index_rev[i1]; ssize_t i2r = this->denoise_index_rev[i2]; bool forget = this->forest_components.get_k() > 1 && this->forest_components.find(i1r) == this->forest_components.find(i2r) && this->forest_components.get_count(i1r) == ds->get_count(i1r) + ds->get_count(i2r); double test_gini = ds->test_gini_after_merge(i1r, i2r, forget); if (test_gini < best_gini) { best_gini = test_gini; best_idx = last_idx; } // printf(" %ld-%ld %.3lf %.3lf\n", i1r, i2r, test_gini, gini_threshold); if (best_gini <= gini_threshold) break; if (last_idx == mst_skiplist->get_key_max()) break; last_idx = mst_skiplist->get_key_next(last_idx); } ssize_t i1 = this->mst_i[2*best_idx+0]; ssize_t i2 = this->mst_i[2*best_idx+1]; ssize_t i1r = this->denoise_index_rev[i1]; ssize_t i2r = this->denoise_index_rev[i2]; bool forget = this->forest_components.get_k() > 1 && this->forest_components.find(i1r) == this->forest_components.find(i2r) && this->forest_components.get_count(i1r) == ds->get_count(i1r) + ds->get_count(i2r); (*links)[it] = best_idx; mst_skiplist->erase(best_idx); // O(1) if (forget) ds->merge_and_forget(i1r, i2r); else ds->merge(i1r, i2r); // printf("%ld-%ld %.3lf\n", i1r, i2r, ds->get_gini()); it++; } return it; // number of merges performed } public: CGenie(T* mst_d, ssize_t* mst_i, ssize_t n, bool noise_leaves=false, bool new_merge=false) : CGenieBase(mst_d, mst_i, n, noise_leaves), new_merge(new_merge) { ; } CGenie() : CGenie(NULL, NULL, 0, false) { } /*! Run the Genie++ algorithm * * @param n_clusters number of clusters to find, 1 for the complete hierarchy * (warning: the algorithm might stop early if there are many noise points * or the number of clusters to detect is > 1). * @param gini_threshold the Gini index threshold */ void apply_genie(ssize_t n_clusters, double gini_threshold) { if (n_clusters < 1) throw std::domain_error("n_clusters must be >= 1"); this->results = typename CGenieBase::CGenieResult(this->n, this->noise_count, n_clusters); CIntDict mst_skiplist(this->n - 1); this->mst_skiplist_init(&mst_skiplist); if (new_merge) { this->results.it = this->do_genie_new(&(this->results.ds), &mst_skiplist, n_clusters, gini_threshold, &(this->results.links)); } else { this->results.it = this->do_genie(&(this->results.ds), &mst_skiplist, n_clusters, gini_threshold, &(this->results.links)); } } }; /*! GIc (Genie+Information Criterion) Hierarchical Clustering Algorithm * * GIc has been proposed by Anna Cena in [1] and was inspired * by Mueller's (et al.) ITM [2] and Gagolewski's (et al.) Genie [3] * * References * ========== * * [1] Cena A., Adaptive hierarchical clustering algorithms based on * data aggregation methods, PhD Thesis, Systems Research Institute, * Polish Academy of Sciences 2018. * * [2] Mueller A., Nowozin S., Lampert C.H., Information Theoretic * Clustering using Minimum Spanning Trees, DAGM-OAGM 2012. * * [3] Gagolewski M., Bartoszuk M., Cena A., * Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, * Information Sciences 363, 2016, pp. 8-23. doi:10.1016/j.ins.2016.05.003 */ template class CGIc : public CGenie { protected: /*! Run the Genie++ algorithm with different thresholds for the Gini index * and determine the intersection of all the resulting * n_clusters-partitions; for this, we need the union of the * set of MST edges that were left "unmerged". * * @param n_clusters number of clusters to look for in Genie run * @param gini_thresholds array of floats in [0,1] * @param n_thresholds size of gini_thresholds * * @return indexes of MST edges that were left unused by at least * one Genie algorithm run; this gives the intersection of partitions. * The resulting list will contain a sentinel, this->n - 1. * * If n_thresholds is 0 or the requested n_clusters is too large, * all non-noise edges are set as unused. */ std::vector get_intersection_of_genies(ssize_t n_clusters, double* gini_thresholds, ssize_t n_thresholds) { std::vector unused_edges; if (n_thresholds == 0 || n_clusters >= this->get_max_n_clusters()) { // all edges unused -> will start from n singletons for (ssize_t i=0; i < this->n - 1; ++i) { ssize_t i1 = this->mst_i[2*i+0]; ssize_t i2 = this->mst_i[2*i+1]; if (i1 < 0 || i2 < 0) continue; // a no-edge -> ignore if (!this->noise_leaves || (this->deg[i1] > 1 && this->deg[i2] > 1)) unused_edges.push_back(i); } unused_edges.push_back(this->n - 1); // sentinel return unused_edges; // EOF. } else { // the same initial skiplist is used in each iter: CIntDict mst_skiplist_template(this->n-1); this->mst_skiplist_init(&mst_skiplist_template); for (ssize_t i=0; iget_max_n_clusters()); std::vector links(this->n - 1, -1); // the history of edge merges CIntDict mst_skiplist(mst_skiplist_template); this->do_genie(&ds, &mst_skiplist, n_clusters, gini_threshold, &links); // start where do_genie() concluded; add all remaining MST edges // to the list of unused_edges while (!mst_skiplist.empty()) unused_edges.push_back(mst_skiplist.pop_key_min()); } // let unused_edges = sort(unique(unused_edges)) unused_edges.push_back(this->n - 1); // sentinel std::sort(unused_edges.begin(), unused_edges.end()); // sorted, but some might not be unique, so let's remove dups ssize_t k = 0; for (ssize_t i=1; i<(ssize_t)unused_edges.size(); ++i) { if (unused_edges[i] != unused_edges[k]) { k++; unused_edges[k] = unused_edges[i]; } } unused_edges.resize(k+1); GENIECLUST_ASSERT(unused_edges[k] == this->n - 1); return unused_edges; } } public: CGIc(T* mst_d, ssize_t* mst_i, ssize_t n, bool noise_leaves=false) : CGenie(mst_d, mst_i, n, noise_leaves) { if (this->forest_components.get_k() > 1) throw std::domain_error("MST is not connected; this is not (yet) supported"); } CGIc() : CGIc(NULL, NULL, 0, false) { } /*! Run the GIc (Genie+Information Criterion) algorithm * * @param n_clusters maximal number of clusters to find, * 1 for the complete hierarchy (if possible) * @param add_clusters number of additional clusters to work * with internally * @param n_features number of features (can be fractional) * @param gini_thresholds array of size n_thresholds * @param n_thresholds size of gini_thresholds */ void apply_gic(ssize_t n_clusters, ssize_t add_clusters, double n_features, double* gini_thresholds, ssize_t n_thresholds) { if (n_clusters < 1) throw std::domain_error("n_clusters must be >= 1"); GENIECLUST_ASSERT(add_clusters>=0); GENIECLUST_ASSERT(n_thresholds>=0); std::vector unused_edges = get_intersection_of_genies( n_clusters+add_clusters, gini_thresholds, n_thresholds ); // note that the unused_edges list: // 1. does not include noise edges; // 2. is sorted (strictly) increasingly // 3. contains a sentinel element at the end == n-1 this->results = typename CGenieBase::CGenieResult(this->n, this->noise_count, n_clusters); // Step 1. Merge all used edges (used by all the Genies) // There are of course many possible merge orders that we could consider // here. We will rely on the current ordering of the MST edges, // which is wrt non-decreasing mst_d. ssize_t cur_unused_edges = 0; ssize_t num_unused_edges = unused_edges.size()-1; // ignore sentinel std::vector cluster_sizes(this->get_max_n_clusters(), 1); std::vector cluster_d_sums(this->get_max_n_clusters(), (T)0.0); this->results.it = 0; for (ssize_t i=0; in - 1; ++i) { GENIECLUST_ASSERT(i<=unused_edges[cur_unused_edges]); if (unused_edges[cur_unused_edges] == i) { // ignore current edge and advance to the next unused edge cur_unused_edges++; continue; } ssize_t i1 = this->mst_i[2*i+0]; ssize_t i2 = this->mst_i[2*i+1]; if (i1 < 0 || i2 < 0) continue; // a no-edge -> ignore if (!this->noise_leaves || (this->deg[i1] > 1 && this->deg[i2] > 1)) { GENIECLUST_ASSERT(this->results.it < this->n-1); this->results.links[this->results.it++] = i; i1 = this->results.ds.find(this->denoise_index_rev[i1]); i2 = this->results.ds.find(this->denoise_index_rev[i2]); if (i1 > i2) std::swap(i1, i2); this->results.ds.merge(i1, i2); // new parent node is i1 cluster_sizes[i1] += cluster_sizes[i2]; cluster_d_sums[i1] += cluster_d_sums[i2] + this->mst_d[i]; cluster_sizes[i2] = 0; cluster_d_sums[i2] = INFTY; } } GENIECLUST_ASSERT(cur_unused_edges == num_unused_edges); // sentinel GENIECLUST_ASSERT(unused_edges[num_unused_edges] == this->n-1); // sentinel GENIECLUST_ASSERT(num_unused_edges+1 == this->results.ds.get_k()); // Step 2. Merge all used edges /* The objective function - Information Criterion - to MAXIMISE is sum_{i in ds.parents()} -cluster_sizes[i] * ( n_features * log cluster_sizes[i] -(n_features-1) * log cluster_d_sums[i] ) */ while (num_unused_edges > 0 && this->results.itget_max_n_clusters() - n_clusters) { ssize_t max_which = -1; double max_obj = -INFTY; for (ssize_t j=0; jmst_i[2*i+0]; ssize_t i2 = this->mst_i[2*i+1]; GENIECLUST_ASSERT(i1 >= 0 && i2 >= 0); i1 = this->results.ds.find(this->denoise_index_rev[i1]); i2 = this->results.ds.find(this->denoise_index_rev[i2]); if (i1 > i2) std::swap(i1, i2); GENIECLUST_ASSERT(i1 != i2); // singletons should be merged first // (we assume that they have cluster_d_sums==Inf // (this was not addressed in Mueller's in his paper) if (cluster_d_sums[i1] < 1e-12 || cluster_d_sums[i2] < 1e-12) { max_which = j; break; } double cur_obj = -(cluster_sizes[i1]+cluster_sizes[i2])*( n_features*std::log((double)cluster_d_sums[i1]+cluster_d_sums[i2]+this->mst_d[i]) -(n_features-1.0)*std::log((double)cluster_sizes[i1]+cluster_sizes[i2]) ); cur_obj += cluster_sizes[i1]*( n_features*std::log((double)cluster_d_sums[i1]) -(n_features-1.0)*std::log((double)cluster_sizes[i1]) ); cur_obj += cluster_sizes[i2]*( n_features*std::log((double)cluster_d_sums[i2]) -(n_features-1.0)*std::log((double)cluster_sizes[i2]) ); GENIECLUST_ASSERT(std::isfinite(cur_obj)); if (cur_obj > max_obj) { max_obj = cur_obj; max_which = j; } } GENIECLUST_ASSERT(max_which >= 0 && max_which < num_unused_edges); ssize_t i = unused_edges[max_which]; GENIECLUST_ASSERT(this->results.it < this->n - 1); this->results.links[this->results.it++] = i; ssize_t i1 = this->mst_i[2*i+0]; ssize_t i2 = this->mst_i[2*i+1]; GENIECLUST_ASSERT(i1 >= 0 && i2 >= 0); i1 = this->results.ds.find(this->denoise_index_rev[i1]); i2 = this->results.ds.find(this->denoise_index_rev[i2]); if (i1 > i2) std::swap(i1, i2); this->results.ds.merge(i1, i2); // new parent node is i1 cluster_sizes[i1] += cluster_sizes[i2]; cluster_d_sums[i1] += cluster_d_sums[i2]+this->mst_d[i]; cluster_sizes[i2] = 0; cluster_d_sums[i2] = INFTY; unused_edges[max_which] = unused_edges[num_unused_edges-1]; num_unused_edges--; } } }; #endif genieclust/src/c_int_dict.h0000644000176200001440000002467014002205521015453 0ustar liggesusers/* class CIntDict * * Copyleft (C) 2018-2021, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_int_dict_h #define __c_int_dict_h #include "c_common.h" #include #include #include /*! ordered_map (dictionary) for keys in {0,1,...,n-1} (small ints). * Elements are stored in the natural 0 <= 1 <= ... <= n-1 order. * * Most supported operations (except for inserting a new key * in the "middle") are as fast as you can get. * Yet, everything comes at a price: here it's the O(n) memory * requirement, even if data are sparse. * * Use case: GiniDisjointSets in the `genieclust` Python package. */ template class CIntDict { protected: ssize_t n; //!< total number of distinct keys possible ssize_t k; //!< number of keys currently stored std::vector tab; //!< tab[i] is the element associated with key i std::vector tab_next; //!< an array-based... std::vector tab_prev; //!< ...doubly-linked list... ssize_t tab_head; //!< ...for quickly accessing and iterating over... ssize_t tab_tail; //!< ...this->tab data public: /*! Constructs an empty container. * * @param n number of elements, n>=0. */ CIntDict(ssize_t n) : tab(n), tab_next(n, n), tab_prev(n, -1) { // if (n < 0) throw std::domain_error("n < 0"); this->n = n; this->k = 0; this->tab_head = n; this->tab_tail = -1; } /*! Constructs a full-size container. * * @param n number of elements, n>=0. * @param val value to replicate at each position */ CIntDict(ssize_t n, const T& val) : tab(n), tab_next(n), tab_prev(n) { // if (n < 0) throw std::domain_error("n < 0"); this->n = n; this->k = n; for (ssize_t i=0; itab_prev[i] = i-1; this->tab_next[i] = i+1; this->tab = val; } this->tab_head = 0; this->tab_tail = n-1; } /*! A nullary constructor allows Cython to allocate * the instances on the stack. Do not use otherwise. */ CIntDict() : CIntDict(0) { } /*! Returns the current number of elements in the container. * * Time complexity: O(1) */ inline ssize_t size() const { return this->k; } /*! Returns the maximum number of elements that the container can hold. */ inline ssize_t max_size() const { return this->n; } /*! Tests whether the container is empty, i.e., its size() is 0. */ inline bool empty() const { return this->k == 0; } /*! Counts the number of elements with given key, i.e., returns 0 or 1 * depending on whether an element with key i exists. * * Time complexity: O(1) * * @param i key in [0,n) */ inline size_t count(ssize_t i) const { if (i < 0 || i >= n) throw std::out_of_range("CIntDict::count key out of range"); return (tab_prev[i]>=0 || tab_next[i] new head and tail tab_head = tab_tail = i; } else if (i < tab_head) { // new head tab_next[i] = tab_head; GENIECLUST_ASSERT(tab_prev[i] == -1); tab_prev[tab_head] = i; tab_head = i; } else if (i > tab_tail) { // new tail tab_next[tab_tail] = i; tab_prev[i] = tab_tail; GENIECLUST_ASSERT(tab_next[i] == n); tab_tail = i; } else { // insert in the "middle" // slow op // TODO skip list, etc. ?? ssize_t elem_before_i = tab_head; while (tab_next[elem_before_i] < i) elem_before_i = tab_next[elem_before_i]; ssize_t elem_after_i = tab_next[elem_before_i]; GENIECLUST_ASSERT(tab_prev[elem_after_i] == elem_before_i); tab_next[i] = elem_after_i; tab_prev[i] = elem_before_i; tab_next[elem_before_i] = i; tab_prev[elem_after_i] = i; } k++; // we have a brand new elem in the storage } return tab[i]; } /*! Removes a single element, provided it exists. * * Time complexity: O(1) * * @param i key in [0,n) * @return the number of elements removed (0 or 1) */ ssize_t erase(ssize_t i) { if (!count(i)) return 0; if (i == tab_head && i == tab_tail) { // that was the last (size-wise) element in the container tab_head = n; tab_tail = -1; } else if (i == tab_head) { // that was the least element tab_head = tab_next[tab_head]; tab_prev[tab_head] = -1; } else if (i == tab_tail) { // that was the largest one tab_tail = tab_prev[tab_tail]; tab_next[tab_tail] = n; } else { // elem in the "middle" ssize_t elem_after_i = tab_next[i]; ssize_t elem_before_i = tab_prev[i]; tab_next[elem_before_i] = elem_after_i; tab_prev[elem_after_i] = elem_before_i; } tab[i] = T(); // force destructor call tab_prev[i] = -1; tab_next[i] = n; k--; return 1; // one element has been removed } ssize_t get_key_min() const { return tab_head; } ssize_t get_key_max() const { return tab_tail; } ssize_t get_key_next(ssize_t i) const { return tab_next[i]; } ssize_t get_key_prev(ssize_t i) const { return tab_prev[i]; } ssize_t pop_key_min() { ssize_t ret = tab_head; erase(ret); return ret; } ssize_t pop_key_max() { ssize_t ret = tab_tail; erase(ret); return ret; } // ------- minimal iterator-based interface ----------------------------- /*! If you want more than merely an input_iterator, * go ahead, implement it and make a pull request :) */ class iterator : public std::iterator { private: const ssize_t* tab_next; ssize_t cur; public: iterator(ssize_t tab_head, ssize_t* tab_next) : tab_next(tab_next), cur(tab_head) { } iterator& operator++() { cur = tab_next[cur]; return *this; } iterator operator++(int) { iterator tmp(*this); operator++(); return tmp; } bool operator==(const iterator& rhs) const { return tab_next==rhs.tab_next && cur==rhs.cur; } bool operator!=(const iterator& rhs) const { return tab_next!=rhs.tab_next || cur!=rhs.cur; } ssize_t operator*() const { return cur; } }; /*! Returns an iterator pointing to the element in the container * that has the least key */ iterator begin() { return iterator(tab_head, tab_next.data()); } /*! Returns an iterator pointing to the past-the-end element */ iterator end() { return iterator(n, tab_next.data()); } // TODO /go ahead, write it, make a pull request/ //Returns an iterator pointing to the element in the container // that has the greatest key // reverse_iterator rbegin() // TODO /go ahead, write it, make a pull request/ //Returns an iterator pointing to the past-the-beginning element // reverse_iterator rend() // TODO /go ahead, write it, make a pull request/ // cbegin, cend, crbegin, crend() // TODO /go ahead, write it, make a pull request/ // Returns an iterator to an element with given key or returns an iterator to end() if not exists // iterator find ( const key_type& k ); //const_iterator find ( const key_type& k ) const; // TODO /go ahead, write it, make a pull request/ // Removes a single element from the container //iterator erase ( const_iterator position ); //by position (1) }; #endif genieclust/src/Makevars.win0000644000176200001440000000015313672011637015500 0ustar liggesusersCXX_STD = CXX11 PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) -DGENIECLUST_R PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS) genieclust/src/c_gini_disjoint_sets.h0000644000176200001440000002650514002205512017544 0ustar liggesusers/* class CGiniDisjointSets * * Copyleft (C) 2018-2021, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_gini_disjoint_sets_h #define __c_gini_disjoint_sets_h #include "c_common.h" #include "c_disjoint_sets.h" #include "c_int_dict.h" /*! "Augmented" Disjoint Sets (Union-Find) Data Structure * * A class to represent partitions of the set {0,1,...,n-1} for any n. * * Stores the size of every set in the partition. */ class CCountDisjointSets : public CDisjointSets{ protected: std::vector cnt; //!< cnt[find(x)] is the size of the relevant subset public: /*! Starts with a "weak" partition { {0}, {1}, ..., {n-1} }, * i.e., n singletons. * * @param n number of elements, n>=0. */ CCountDisjointSets(ssize_t n) : CDisjointSets(n), cnt(n, 1) // each cluster is of size 1 { ; } /*! A nullary constructor allows Cython to allocate * the instances on the stack. Do not use otherwise. */ CCountDisjointSets() : CCountDisjointSets(0) { } /*! Returns the size of the subset containing x. * * Run time: the cost of find(x) */ ssize_t get_count(ssize_t x) { x = this->find(x); return this->cnt[x]; } /*! Merges the sets containing x and y in {0,...,n-1}. * * Let px be the parent id of x, and py be the parent id of y. * If px < py, then the new parent id of py will be set to py. * Otherwise, px will have py as its parent. * * If x and y are members of the same subset, * an exception is thrown. * * @return the id of the parent of x or y, whichever is smaller. * * @param x a value in {0,...,n-1} * @param y a value in {0,...,n-1} * * Update time: amortised O(1). */ virtual ssize_t merge(ssize_t x, ssize_t y) { // well, union is a reserved C++ keyword :) x = this->find(x); // includes a range check for x y = this->find(y); // includes a range check for y if (x == y) throw std::invalid_argument("find(x) == find(y)"); if (y < x) std::swap(x, y); // DisjointSet's merge part: this->par[y] = x; // update the parent of y this->k -= 1; // decrease the subset count // update the counts this->cnt[x] += this->cnt[y]; // cluster x has more elements now this->cnt[y] = 0; // cluster y, well, cleaning up return x; } }; /*! "Augmented" Disjoint Sets (Union-Find) Data Structure * * A class to represent partitions of the set {0,1,...,n-1} for any n. * * The class allows to compute the normalised Gini index of the distribution * of subset sizes, i.e., * \[ * G(x_1,\dots,x_k) = \frac{ * \sum_{i=1}^{n-1} \sum_{j=i+1}^n |x_i-x_j| * }{ * (n-1) \sum_{i=1}^n x_i * }. * \] * * The merge() operation, which also recomputes the Gini index, * has O(sqrt n) time complexity. * * For a use case, see: Gagolewski M., Bartoszuk M., Cena A., * Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, * Information Sciences 363, 2016, pp. 8-23. doi:10.1016/j.ins.2016.05.003 */ class CGiniDisjointSets : public CCountDisjointSets{ protected: CIntDict number_of_size; /*!< number_of_size[i] gives the number * of subsets of size i (there are at most sqrt(n) possible * non-zero elements) */ double gini; //!< the Gini index of the current subset sizes ssize_t forgotten; /*! Re-compute the normalized Gini index * * based on a formula given in [TODO:derive the formula nicely] */ void recompute_gini() { gini = 0.0; if (number_of_size.size() > 1) { // otherwise all clusters are of identical sizes GENIECLUST_ASSERT(k-forgotten-1 > 0) ssize_t v = number_of_size.get_key_min(); ssize_t i = 0; while (v != number_of_size.get_key_max()) { ssize_t w = v; // previous v v = number_of_size.get_key_next(v); // next v i += number_of_size[w]; // cumulative counts gini += ((double)v-w)*i*((double)k-forgotten-i); } gini /= (double)(n*(k-forgotten-1.0)); // this is the normalised Gini index if (gini > 1.0) gini = 1.0; // account for round-off errors else if (gini < 0.0) gini = 0.0; } } /*! called by merge(x, y) and merge_and_forget(x, y) */ ssize_t merge(ssize_t x, ssize_t y, bool forget) { x = this->find(x); // includes a range check for x y = this->find(y); // includes a range check for y if (x == y) throw std::invalid_argument("find(x) == find(y)"); if (y < x) std::swap(x, y); // DisjointSet's merge part: this->par[y] = x; // update the parent of y this->k -= 1; // decrease the subset count // CCountDisjointSets's merge part: ssize_t size1 = this->cnt[x]; ssize_t size2 = this->cnt[y]; this->cnt[x] += this->cnt[y]; // cluster x has more elements now this->cnt[y] = 0; // cluster y, well, cleaning up //GENIECLUST_ASSERT(number_of_size.at(size1)>0); number_of_size[size1] -= 1; // one cluster of size1 is no more //GENIECLUST_ASSERT(number_of_size.at(size2)>0); number_of_size[size2] -= 1; // one cluster of size2 is an ex-cluster // get rid of size1 and size2, if necessary if (size2 < size1) std::swap(size1, size2); if (number_of_size.at(size1) <= 0) number_of_size.erase(size1); // fast if (size1 != size2 && number_of_size.at(size2) <= 0) number_of_size.erase(size2); // fast if (!forget) { ssize_t size12 = size1+size2; if (number_of_size.count(size12) == 0) number_of_size[size12] = 1; // might be O(sqrt(n)) else number_of_size[size12] += 1; // long live cluster of size1+2 } recompute_gini(); // all done return x; } public: /*! Starts with a "weak" partition { {0}, {1}, ..., {n-1} }, * i.e., n singletons. * * @param n number of elements, n>=0. */ CGiniDisjointSets(ssize_t n) : CCountDisjointSets(n), number_of_size(n+1), forgotten(0) { if (n>0) number_of_size[1] = n; // there are n clusters of size 1 gini = 0.0; // a perfectly balanced cluster size distribution } /*! A nullary constructor allows Cython to allocate * the instances on the stack. Do not use otherwise. */ CGiniDisjointSets() : CGiniDisjointSets(0) { } /*! Returns the Gini index of the distribution of subsets' sizes. * * Run time: O(1), as the Gini index is updated during a call * to merge(). */ double get_gini() const { return this->gini; } /*! Returns the size of the smallest subset. * * Run time: O(1). */ ssize_t get_smallest_count() const { return number_of_size.get_key_min(); /*this->tab_head;*/ } /*! Returns the size of the largest subset. * * Run time: O(1). */ ssize_t get_largest_count() const { return number_of_size.get_key_max(); /*this->tab_tail;*/ } /*! Returns the number of sets of given size * * Run time: O(1). */ ssize_t get_k_of_size(ssize_t c) { return number_of_size[c]; } /*! Determine the Gini index that you would get if x and y * were merged. */ double test_gini_after_merge(ssize_t x, ssize_t y, bool forget) { x = this->find(x); // includes a range check for x y = this->find(y); // includes a range check for y ssize_t size1 = this->cnt[x]; ssize_t size2 = this->cnt[y]; ssize_t size12 = size1+size2; if (!(size1 <= size2)) std::swap(size1, size2); double new_gini = gini*(n)*(k-forgotten-1.0); ssize_t v = number_of_size.get_key_min(); while (true) { ssize_t vc = number_of_size[v]; new_gini -= vc*std::fabs(v-size1); new_gini -= vc*std::fabs(v-size2); if (!forget) new_gini += vc*std::fabs(v-size12); if (v == number_of_size.get_key_max()) break; v = number_of_size.get_key_next(v); // next v } new_gini += std::fabs(size2-size1); if (!forget) { new_gini -= std::fabs(size2-size12); new_gini -= std::fabs(size1-size12); } if (forget) ++forgotten; new_gini /= (n)*(double)(k-1-forgotten-1.0); new_gini = std::min(1.0, std::max(0.0, new_gini)); // avoid numeric inaccuracies return new_gini; } /*! Merges the sets containing x and y in {0,...,n-1}. * * Let px be the parent id of x, and py be the parent id of y. * If px < py, then the new parent id of py will be set to py. * Otherwise, px will have py as its parent. * * If x and y are members of the same subset, * an exception is thrown. * * @return the id of the parent of x or y, whichever is smaller. * * @param x a value in {0,...,n-1} * @param y a value in {0,...,n-1} * * Update time: worst-case amortised O(sqrt(n)). */ virtual ssize_t merge(ssize_t x, ssize_t y) { return merge(x, y, /*forget=*/false); } /*! Merges the sets containing x and y in {0,...,n-1} * * The new cluster will not be taken into account when * computing the Gini index and calling get_smallest_count(). * * * @return the id of the parent of x or y, whichever is smaller. * * @param x a value in {0,...,n-1} * @param y a value in {0,...,n-1} * * Update time: worst-case amortised O(sqrt(n)). */ ssize_t merge_and_forget(ssize_t x, ssize_t y) { ++forgotten; return merge(x, y, /*forget=*/true); } /*! Generates an array of subsets' sizes. * The resulting vector is ordered nondecreasingly. * * Run time: O(k), where k is the current number of subsets. * * This is only valid if merge_and_forget has not been used. * * @param res [out] c_contiguous array of length k */ void get_counts(ssize_t* res) { GENIECLUST_ASSERT(forgotten == 0) ssize_t i = 0; for (CIntDict::iterator it = number_of_size.begin(); it != number_of_size.end(); ++it) { // add this->tab[v] times v for (ssize_t j=0; j * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_postprocess_h #define __c_postprocess_h #include "c_common.h" #include /*! Merge all "boundary" noise points with their nearest "core" points * * * For all the boundary points i, set c[i] = c[j], * where {i,j} is an edge in a spanning forest given by adjacency matrix ind. * * The i-th point is a boundary point if it is a noise point, i.e., c[i] < 0, * and it's amongst j's M-1 nearest neighbours. * * * @param ind c_contiguous matrix of size num_edges*2, * where {ind[i,0], ind[i,1]} specifies the i-th (undirected) edge * in a spanning tree or forest; ind[i,j] < n. * Edges with ind[i,0] < 0 or ind[i,1] < 0 are purposely ignored. * @param num_edges number of rows in ind (edges) * @param nn c_contiguous matrix of size n*num_neighbours; * nn[i,:] gives the indices of the i-th point's * nearest neighbours; -1 indicates a "missing value" * @param num_neighbours number of columns in nn * @param M smoothing factor, 2 <= M < num_neighbours * @param c [in/out] c_contiguous vector of length n, where * c[i] denotes the cluster id * (in {-1, 0, 1, ..., k-1} for some k) of the i-th object, i=0,...,n-1. * Class -1 denotes the `noise' cluster. * @param n length of c and the number of vertices in the spanning forest */ void Cmerge_boundary_points( const ssize_t* ind, ssize_t num_edges, const ssize_t* nn, ssize_t num_neighbours, ssize_t M, ssize_t* c, ssize_t n) { if (M < 2 || M-2 >= num_neighbours) throw std::domain_error("Incorrect smoothing factor M"); for (ssize_t i=0; i=n || v>=n) throw std::domain_error("All elements must be <= n"); if (c[u] < 0 && c[v] < 0) throw std::domain_error("Edge between two unallocated points detected"); if (c[u] >= 0 && c[v] >= 0) continue; if (c[v] < 0) std::swap(u, v); GENIECLUST_ASSERT(c[u] < 0); // u is marked as a noise point GENIECLUST_ASSERT(c[v] >= 0); // v is a core point // a noise point is not necessarily a boundary point: // u is a boundary point if u is amongst v's M-1 nearest neighbours //c[u] = -1; // it's negative anyway for (ssize_t j=0; j=n || v>=n) throw std::domain_error("All elements must be <= n"); if (c[u] < 0 && c[v] < 0) throw std::domain_error("An edge between two unallocated points detected"); if (c[u] < 0) c[u] = c[v]; else if (c[v] < 0) c[v] = c[u]; //else // continue; } } #endif genieclust/src/c_preprocess.h0000644000176200001440000000375314002205537016051 0ustar liggesusers/* Graph pre-processing and other routines * * Copyleft (C) 2018-2021, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_preprocess_h #define __c_preprocess_h #include "c_common.h" #include #include #include #include #include "c_gini_disjoint_sets.h" #include "c_int_dict.h" /*! Compute the degree of each vertex in an undirected graph * over vertex set {0,...,n-1} * * @param ind c_contiguous matrix of size num_edges*2, * where {ind[i,0], ind[i,1]} is the i-th edge * with ind[i,j] < n. * Edges with ind[i,0] < 0 or ind[i,1] < 0 are purposely ignored. * @param num_edges number of edges (rows in ind) * @param n number of vertices * @param deg [out] array of size n, where * deg[i] will give the degree of the i-th vertex. */ void Cget_graph_node_degrees( const ssize_t* ind, ssize_t num_edges, ssize_t n, ssize_t* deg) { for (ssize_t i=0; i=n || v>=n) throw std::domain_error("All elements must be <= n"); if (u == v) throw std::domain_error("Self-loops are not allowed"); deg[u]++; deg[v]++; } } #endif genieclust/src/r_inequity.cpp0000644000176200001440000001037714040205757016113 0ustar liggesusers/* Economic Inequity (Inequality) Measures. * * Copyleft (C) 2018-2021, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #include "c_inequity.h" #include #include //' @title Inequity (Inequality) Measures //' //' @description //' \code{gini_index()} gives the normalised Gini index //' and \code{bonferroni_index()} implements the Bonferroni index. //' //' @details //' Both indices can be used to quantify the "inequity" of a numeric sample. //' They can be perceived as measures of data dispersion. //' For constant vectors (perfect equity), the indices yield values of 0. //' Vectors with all elements but one equal to 0 (perfect inequity), //' are assigned scores of 1. //' Both indices follow the Pigou-Dalton principle (are Schur-convex): //' setting \eqn{x_i = x_i - h} and \eqn{x_j = x_j + h} with \eqn{h > 0} //' and \eqn{x_i - h \geq x_j + h} (taking from the "rich" and //' giving to the "poor") decreases the inequity. //' //' These indices have applications in economics, amongst others. //' The Gini clustering algorithm uses the Gini index as a measure //' of the inequality of cluster sizes. //' //' //' The normalised Gini index is given by: //' \deqn{ //' G(x_1,\dots,x_n) = \frac{ //' \sum_{i=1}^{n-1} \sum_{j=i+1}^n |x_i-x_j| //' }{ //' (n-1) \sum_{i=1}^n x_i //' }. //' } //' //' The normalised Bonferroni index is given by: //' \deqn{ //' B(x_1,\dots,x_n) = \frac{ //' \sum_{i=1}^{n} (n-\sum_{j=1}^i \frac{n}{n-j+1}) //' x_{\sigma(n-i+1)} //' }{ //' (n-1) \sum_{i=1}^n x_i //' }. //' } //' //' //' Time complexity: \eqn{O(n)} for sorted (increasingly) data. //' Otherwise, the vector will be sorted. //' //' In particular, for ordered inputs, it holds: //' \deqn{ //' G(x_1,\dots,x_n) = \frac{ //' \sum_{i=1}^{n} (n-2i+1) x_{\sigma(n-i+1)} //' }{ //' (n-1) \sum_{i=1}^n x_i //' }, //' } //' where \eqn{\sigma} is an ordering permutation of \eqn{(x_1,\dots,x_n)}. //' //' //' @references //' Bonferroni C., Elementi di Statistica Generale, Libreria Seber, //' Firenze, 1930. //' //' Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and //' outlier-resistant hierarchical clustering algorithm, //' Information Sciences 363, 2016, pp. 8-23. doi:10.1016/j.ins.2016.05.003 //' //' Gini C., Variabilita e Mutabilita, Tipografia di Paolo Cuppini, Bologna, 1912. //' //' //' @param x numeric vector of non-negative values //' //' @return The value of the inequity index, a number in \eqn{[0, 1]}. //' //' @examples //' gini_index(c(2, 2, 2, 2, 2)) # no inequality //' gini_index(c(0, 0, 10, 0, 0)) # one has it all //' gini_index(c(7, 0, 3, 0, 0)) # give to the poor, take away from the rich //' gini_index(c(6, 0, 3, 1, 0)) # (a.k.a. Pigou-Dalton principle) //' bonferroni_index(c(2, 2, 2, 2, 2)) //' bonferroni_index(c(0, 0, 10, 0, 0)) //' bonferroni_index(c(7, 0, 3, 0, 0)) //' bonferroni_index(c(6, 0, 3, 1, 0)) //' //' @rdname inequity //' @export // [[Rcpp::export]] double gini_index(Rcpp::NumericVector x) { ssize_t n = x.size(); // check if sorted; if not, sort. for (ssize_t i=1; i x[i]) { x = Rcpp::clone(x); std::sort(x.begin(), x.end()); break; } } return Cgini_sorted(REAL(SEXP(x)), n); } //' @rdname inequity //' @export // [[Rcpp::export]] double bonferroni_index(Rcpp::NumericVector x) { ssize_t n = x.size(); // check if sorted; if not, sort. for (ssize_t i=1; i x[i]) { x = Rcpp::clone(x); std::sort(x.begin(), x.end()); break; } } return Cbonferroni_sorted(REAL(SEXP(x)), n); } genieclust/src/c_scipy_rectangular_lsap.h0000644000176200001440000001760413706431457020435 0ustar liggesusers/* This file is adapted from scipy/scipy/optimize/rectangular_lsap/rectangular_lsap.cpp (version last updated on 5 Mar 2020; c050fd9) See https://github.com/scipy/scipy/ and https://scipy.org/scipylib/. This code implements the shortest augmenting path algorithm for the rectangular assignment problem. This implementation is based on the pseudocode described in pages 1685-1686 of: Crouse D.F., On implementing 2D rectangular assignment algorithms, *IEEE Transactions on Aerospace and Electronic Systems* **52**(4), 2016, pp. 1679-1696, doi:10.1109/TAES.2016.140952. Original author: Peter M. Larsen (https://github.com/pmla/). Thanks!!! Keywords: the Hungarian algorithm, Kuhn-Munkres algorithm, a modified Jonker-Volgenant algorithm. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #ifndef __c_scipy_rectangular_lsap_h #define __c_scipy_rectangular_lsap_h #include "c_common.h" #include #include #include ssize_t __augmenting_path( ssize_t nc, std::vector& cost, std::vector& u, std::vector& v, std::vector& path, std::vector& row4col, std::vector& shortestPathCosts, ssize_t i, std::vector& SR, std::vector& SC, double* p_minVal); /** * Solves the 2D rectangular assignment problem * using the algorithm described in * * The procedure is adapted from * scipy/scipy/optimize/rectangular_lsap/rectangular_lsap.cpp * (version last updated on 5 Mar; c050fd9) * See https://github.com/scipy/scipy/ and https://scipy.org/scipylib/. * Author: P.M. Larsen. * * * * References * ========== * * Crouse D.F., On implementing 2D rectangular assignment algorithms, * *IEEE Transactions on Aerospace and Electronic Systems* **52**(4), 2016, * pp. 1679-1696, doi:10.1109/TAES.2016.140952. * * * @param C c_contiguous cost matrix; shape nr*nc * @param nr number of rows in C * @param nc number of cols in C, nc>=nr * @param output_col4row [output] c_contiguous vector of length nr; * (i, output_col4row[i]) gives location of the * nr items in C with the smallest sum. * @param minimise false if we seek the maximum * * @return 0 on success */ template ssize_t linear_sum_assignment( T* C, ssize_t nr, ssize_t nc, ssize_t* output_col4row, bool minimise=true) { if (nr > nc) throw std::domain_error("nr > nc"); // build a non-negative cost matrix std::vector cost(nr * nc); if (minimise) { double minval = *std::min_element(C, C + nr * nc); for (ssize_t i = 0; i < nr * nc; i++) { cost[i] = C[i] - minval; } } else { double maxval = *std::max_element(C, C + nr * nc); for (ssize_t i = 0; i < nr * nc; i++) { cost[i] = maxval-C[i]; } } // initialize variables std::vector u(nr, 0); std::vector v(nc, 0); std::vector shortestPathCosts(nc); std::vector path(nc, -1); std::vector col4row(nr, -1); std::vector row4col(nc, -1); std::vector SR(nr); std::vector SC(nc); // iteratively build the solution for (ssize_t curRow = 0; curRow < nr; curRow++) { double minVal; ssize_t sink = __augmenting_path(nc, cost, u, v, path, row4col, shortestPathCosts, curRow, SR, SC, &minVal); if (sink < 0) { return -1; } // update dual variables u[curRow] += minVal; for (ssize_t i = 0; i < nr; i++) { if (SR[i] && i != curRow) { u[i] += minVal - shortestPathCosts[col4row[i]]; } } for (ssize_t j = 0; j < nc; j++) { if (SC[j]) { v[j] -= minVal - shortestPathCosts[j]; } } // augment previous solution ssize_t j = sink; while (1) { ssize_t i = path[j]; row4col[j] = i; std::swap(col4row[i], j); if (i == curRow) { break; } } } for (ssize_t i = 0; i < nr; i++) { output_col4row[i] = col4row[i]; } return 0; } ssize_t __augmenting_path( ssize_t nc, std::vector& cost, std::vector& u, std::vector& v, std::vector& path, std::vector& row4col, std::vector& shortestPathCosts, ssize_t i, std::vector& SR, std::vector& SC, double* p_minVal) { double minVal = 0; // Crouse's pseudocode uses set complements to keep track of remaining // nodes. Here we use a vector, as it is more efficient in C++. ssize_t num_remaining = nc; std::vector remaining(nc); for (ssize_t it = 0; it < nc; it++) { // Filling this up in reverse order ensures that the solution of a // constant cost matrix is the identity matrix (c.f. #11602). remaining[it] = nc - it - 1; } std::fill(SR.begin(), SR.end(), false); std::fill(SC.begin(), SC.end(), false); std::fill(shortestPathCosts.begin(), shortestPathCosts.end(), INFINITY); // find shortest augmenting path ssize_t sink = -1; while (sink == -1) { ssize_t index = -1; double lowest = INFINITY; SR[i] = true; for (ssize_t it = 0; it < num_remaining; it++) { ssize_t j = remaining[it]; double r = minVal + cost[i * nc + j] - u[i] - v[j]; if (r < shortestPathCosts[j]) { path[j] = i; shortestPathCosts[j] = r; } // When multiple nodes have the minimum cost, we select one which // gives us a new sink node. This is particularly important for // cost matrices with small coefficients. if (shortestPathCosts[j] < lowest || (shortestPathCosts[j] == lowest && row4col[j] == -1)) { lowest = shortestPathCosts[j]; index = it; } } minVal = lowest; ssize_t j = remaining[index]; if (minVal == INFINITY) { // infeasible cost matrix return -1; } if (row4col[j] == -1) { sink = j; } else { i = row4col[j]; } SC[j] = true; remaining[index] = remaining[--num_remaining]; remaining.resize(num_remaining); } *p_minVal = minVal; return sink; } #endif genieclust/src/r_emst.cpp0000644000176200001440000000353414040172545015207 0ustar liggesusers/* Calls RcppMLPACK::DualTreeBoruvka::ComputeMST * * Copyleft (C) 2018-2021, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ // version 1.0.0 now uses mlpack::emst // #include // // #include // #include // #include "c_common.h" // // // // // Euclidean MST via MLPACK // // Calls RcppMLPACK::DualTreeBoruvka::ComputeMST // // [[Rcpp::export(".emst_mlpack")]] // Rcpp::NumericMatrix dot_emst_mlpack(Rcpp::NumericMatrix X) // { // ssize_t n = X.nrow(); // ssize_t d = X.ncol(); // // // Let aX = transpose(X) // arma::Mat aX(d, n); // for (ssize_t i=0; i aret; // mlpack::emst::DualTreeBoruvka<>(aX).ComputeMST(aret); // // Rcpp::NumericMatrix ret(n-1, 3); // for (ssize_t i=0; i 1-based) // ret(i, 1) = aret(1, i)+1; // greater edge index // ret(i, 2) = aret(2, i); // distance between the pair of points // GENIECLUST_ASSERT(ret(i, 0) < ret(i, 1)); // GENIECLUST_ASSERT(i == 0 || ret(i-1, 2) <= ret(i, 2)); // } // // return ret; // } genieclust/src/r_gclust.cpp0000644000176200001440000002532013775715173015553 0ustar liggesusers/* The Genie++ Clustering Algorithm - R Wrapper * * Copyleft (C) 2018-2021, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #include "c_common.h" #include "c_matrix.h" #include "c_distance.h" #include "c_mst.h" #include "c_genie.h" #include "c_postprocess.h" #include using namespace Rcpp; /* This function was originally part of our `genie` package for R */ void internal_generate_merge(ssize_t n, NumericMatrix links, NumericMatrix merge) { std::vector elements(n+1, 0); std::vector parents(n+1, 0); ssize_t clusterNumber = 1; for (ssize_t k=0; k merge(k, 1)) std::swap(merge(k, 0), merge(k, 1)); } } } /* Originally, this function was part of our `genie` package for R */ void internal_generate_order(ssize_t n, NumericMatrix merge, NumericVector order) { std::vector< std::list > relord(n+1); ssize_t clusterNumber = 1; for (ssize_t k=0; k::iterator it = relord[n-1].begin(); it != relord[n-1].end(); ++it) { order[k++] = (*it); } } template NumericMatrix internal_compute_mst(CDistance* D, ssize_t n, ssize_t M, bool verbose) { if (M < 1 || M >= n-1) stop("`M` must be an integer in [1, n-1)"); NumericMatrix ret(n-1, 3); CDistance* D2 = NULL; if (M >= 2) { // yep, we need it for M==2 as well if (verbose) GENIECLUST_PRINT("[genieclust] Determining the core distance.\n"); ssize_t k = M-1; matrix nn_i(n, k); matrix nn_d(n, k); Cknn_from_complete(D, n, k, nn_d.data(), nn_i.data()); NumericMatrix nn_r(n, k); std::vector d_core(n); for (ssize_t i=0; i(d_core.data(), n, D); } matrix mst_i(n-1, 2); std::vector mst_d(n-1); if (verbose) GENIECLUST_PRINT("[genieclust] Computing the MST.\n"); Cmst_from_complete(D2?D2:D, n, mst_d.data(), mst_i.data(), verbose); if (verbose) GENIECLUST_PRINT("[genieclust] Done.\n"); if (D2) delete D2; for (ssize_t i=0; i NumericMatrix internal_mst_default( NumericMatrix X, String distance, ssize_t M, /*bool use_mlpack, */ bool verbose) { ssize_t n = X.nrow(); ssize_t d = X.ncol(); NumericMatrix ret; matrix X2(REAL(SEXP(X)), n, d, false); // Fortran- to C-contiguous CDistance* D = NULL; if (distance == "euclidean" || distance == "l2") D = (CDistance*)(new CDistanceEuclideanSquared(X2.data(), n, d)); else if (distance == "manhattan" || distance == "cityblock" || distance == "l1") D = (CDistance*)(new CDistanceManhattan(X2.data(), n, d)); else if (distance == "cosine") D = (CDistance*)(new CDistanceCosine(X2.data(), n, d)); else stop("given `distance` is not supported (yet)"); ret = internal_compute_mst(D, n, M, verbose); delete D; if (distance == "euclidean" || distance == "l2") { for (ssize_t i=0; i(X, distance, M, verbose); else return internal_mst_default(X, distance, M, verbose); } // [[Rcpp::export(".mst.dist")]] NumericMatrix dot_mst_dist( NumericVector d, int M=1, bool verbose=false) { ssize_t n = (ssize_t)round((sqrt(1.0+8.0*d.size())+1.0)/2.0); GENIECLUST_ASSERT(n*(n-1)/2 == d.size()); CDistancePrecomputedVector D(REAL(SEXP(d)), n); return internal_compute_mst(&D, n, M, verbose); } // [[Rcpp::export(".genie")]] IntegerVector dot_genie( NumericMatrix mst, int k, double gini_threshold, String postprocess, bool detect_noise, bool verbose) { if (verbose) GENIECLUST_PRINT("[genieclust] Determining clusters.\n"); if (gini_threshold < 0.0 || gini_threshold > 1.0) stop("`gini_threshold` must be in [0, 1]"); if (postprocess == "boundary" && detect_noise && Rf_isNull(mst.attr("nn"))) stop("`nn` attribute of the MST not set; unable to proceed with this postprocessing action"); ssize_t n = mst.nrow()+1; if (k < 1 || k > n) stop("invalid requested number of clusters, `k`"); matrix mst_i(n-1, 2); std::vector mst_d(n-1); for (ssize_t i=0; i g(mst_d.data(), mst_i.data(), n, detect_noise); g.apply_genie(k, gini_threshold); if (verbose) GENIECLUST_PRINT("[genieclust] Postprocessing the outputs.\n"); std::vector xres(n); ssize_t k_detected = g.get_labels(k, xres.data()); if (k_detected != k) Rf_warning("Number of clusters detected is different than the requested one due to the presence of noise points."); if (detect_noise && postprocess == "boundary") { NumericMatrix nn_r = mst.attr("nn"); GENIECLUST_ASSERT(nn_r.nrow() == n); ssize_t M = nn_r.ncol()+1; GENIECLUST_ASSERT(M < n); matrix nn_i(n, M-1); for (ssize_t i=0; i= 1); GENIECLUST_ASSERT(nn_r(i,j) <= n); nn_i(i,j) = (ssize_t)nn_r(i,j)-1; // 0-based indexing } } Cmerge_boundary_points(mst_i.data(), n-1, nn_i.data(), M-1, M, xres.data(), n); } else if (detect_noise && postprocess == "all") { Cmerge_noise_points(mst_i.data(), n-1, xres.data(), n); } IntegerVector res(n); for (ssize_t i=0; i 1.0) stop("`gini_threshold` must be in [0, 1]"); ssize_t n = mst.nrow()+1; matrix mst_i(n-1, 2); std::vector mst_d(n-1); for (ssize_t i=0; i g(mst_d.data(), mst_i.data(), n/*, noise_leaves=M>1*/); g.apply_genie(1, gini_threshold); if (verbose) GENIECLUST_PRINT("[genieclust] Postprocessing the outputs.\n"); std::vector links(n-1); g.get_links(links.data()); NumericMatrix links2(n-1, 2); NumericVector height(n-1, NA_REAL); ssize_t k = 0; for (ssize_t i=0; i= 0) { links2(k, 0) = mst_i(links[i], 0) + 1; links2(k, 1) = mst_i(links[i], 1) + 1; height(k) = mst_d[ links[i] ]; ++k; } } for (; k do not edit by hand // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 #include using namespace Rcpp; // adjusted_rand_score double adjusted_rand_score(RObject x, RObject y); RcppExport SEXP _genieclust_adjusted_rand_score(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(adjusted_rand_score(x, y)); return rcpp_result_gen; END_RCPP } // rand_score double rand_score(RObject x, RObject y); RcppExport SEXP _genieclust_rand_score(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(rand_score(x, y)); return rcpp_result_gen; END_RCPP } // adjusted_fm_score double adjusted_fm_score(RObject x, RObject y); RcppExport SEXP _genieclust_adjusted_fm_score(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(adjusted_fm_score(x, y)); return rcpp_result_gen; END_RCPP } // fm_score double fm_score(RObject x, RObject y); RcppExport SEXP _genieclust_fm_score(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(fm_score(x, y)); return rcpp_result_gen; END_RCPP } // mi_score double mi_score(RObject x, RObject y); RcppExport SEXP _genieclust_mi_score(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(mi_score(x, y)); return rcpp_result_gen; END_RCPP } // normalized_mi_score double normalized_mi_score(RObject x, RObject y); RcppExport SEXP _genieclust_normalized_mi_score(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(normalized_mi_score(x, y)); return rcpp_result_gen; END_RCPP } // adjusted_mi_score double adjusted_mi_score(RObject x, RObject y); RcppExport SEXP _genieclust_adjusted_mi_score(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(adjusted_mi_score(x, y)); return rcpp_result_gen; END_RCPP } // normalized_accuracy double normalized_accuracy(RObject x, RObject y); RcppExport SEXP _genieclust_normalized_accuracy(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(normalized_accuracy(x, y)); return rcpp_result_gen; END_RCPP } // pair_sets_index double pair_sets_index(RObject x, RObject y); RcppExport SEXP _genieclust_pair_sets_index(SEXP xSEXP, SEXP ySEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type x(xSEXP); Rcpp::traits::input_parameter< RObject >::type y(ySEXP); rcpp_result_gen = Rcpp::wrap(pair_sets_index(x, y)); return rcpp_result_gen; END_RCPP } // dot_mst_default NumericMatrix dot_mst_default(NumericMatrix X, String distance, int M, bool cast_float32, bool verbose); RcppExport SEXP _genieclust_dot_mst_default(SEXP XSEXP, SEXP distanceSEXP, SEXP MSEXP, SEXP cast_float32SEXP, SEXP verboseSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type X(XSEXP); Rcpp::traits::input_parameter< String >::type distance(distanceSEXP); Rcpp::traits::input_parameter< int >::type M(MSEXP); Rcpp::traits::input_parameter< bool >::type cast_float32(cast_float32SEXP); Rcpp::traits::input_parameter< bool >::type verbose(verboseSEXP); rcpp_result_gen = Rcpp::wrap(dot_mst_default(X, distance, M, cast_float32, verbose)); return rcpp_result_gen; END_RCPP } // dot_mst_dist NumericMatrix dot_mst_dist(NumericVector d, int M, bool verbose); RcppExport SEXP _genieclust_dot_mst_dist(SEXP dSEXP, SEXP MSEXP, SEXP verboseSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericVector >::type d(dSEXP); Rcpp::traits::input_parameter< int >::type M(MSEXP); Rcpp::traits::input_parameter< bool >::type verbose(verboseSEXP); rcpp_result_gen = Rcpp::wrap(dot_mst_dist(d, M, verbose)); return rcpp_result_gen; END_RCPP } // dot_genie IntegerVector dot_genie(NumericMatrix mst, int k, double gini_threshold, String postprocess, bool detect_noise, bool verbose); RcppExport SEXP _genieclust_dot_genie(SEXP mstSEXP, SEXP kSEXP, SEXP gini_thresholdSEXP, SEXP postprocessSEXP, SEXP detect_noiseSEXP, SEXP verboseSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type mst(mstSEXP); Rcpp::traits::input_parameter< int >::type k(kSEXP); Rcpp::traits::input_parameter< double >::type gini_threshold(gini_thresholdSEXP); Rcpp::traits::input_parameter< String >::type postprocess(postprocessSEXP); Rcpp::traits::input_parameter< bool >::type detect_noise(detect_noiseSEXP); Rcpp::traits::input_parameter< bool >::type verbose(verboseSEXP); rcpp_result_gen = Rcpp::wrap(dot_genie(mst, k, gini_threshold, postprocess, detect_noise, verbose)); return rcpp_result_gen; END_RCPP } // dot_gclust List dot_gclust(NumericMatrix mst, double gini_threshold, bool verbose); RcppExport SEXP _genieclust_dot_gclust(SEXP mstSEXP, SEXP gini_thresholdSEXP, SEXP verboseSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type mst(mstSEXP); Rcpp::traits::input_parameter< double >::type gini_threshold(gini_thresholdSEXP); Rcpp::traits::input_parameter< bool >::type verbose(verboseSEXP); rcpp_result_gen = Rcpp::wrap(dot_gclust(mst, gini_threshold, verbose)); return rcpp_result_gen; END_RCPP } // gini_index double gini_index(Rcpp::NumericVector x); RcppExport SEXP _genieclust_gini_index(SEXP xSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::NumericVector >::type x(xSEXP); rcpp_result_gen = Rcpp::wrap(gini_index(x)); return rcpp_result_gen; END_RCPP } // bonferroni_index double bonferroni_index(Rcpp::NumericVector x); RcppExport SEXP _genieclust_bonferroni_index(SEXP xSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< Rcpp::NumericVector >::type x(xSEXP); rcpp_result_gen = Rcpp::wrap(bonferroni_index(x)); return rcpp_result_gen; END_RCPP } static const R_CallMethodDef CallEntries[] = { {"_genieclust_adjusted_rand_score", (DL_FUNC) &_genieclust_adjusted_rand_score, 2}, {"_genieclust_rand_score", (DL_FUNC) &_genieclust_rand_score, 2}, {"_genieclust_adjusted_fm_score", (DL_FUNC) &_genieclust_adjusted_fm_score, 2}, {"_genieclust_fm_score", (DL_FUNC) &_genieclust_fm_score, 2}, {"_genieclust_mi_score", (DL_FUNC) &_genieclust_mi_score, 2}, {"_genieclust_normalized_mi_score", (DL_FUNC) &_genieclust_normalized_mi_score, 2}, {"_genieclust_adjusted_mi_score", (DL_FUNC) &_genieclust_adjusted_mi_score, 2}, {"_genieclust_normalized_accuracy", (DL_FUNC) &_genieclust_normalized_accuracy, 2}, {"_genieclust_pair_sets_index", (DL_FUNC) &_genieclust_pair_sets_index, 2}, {"_genieclust_dot_mst_default", (DL_FUNC) &_genieclust_dot_mst_default, 5}, {"_genieclust_dot_mst_dist", (DL_FUNC) &_genieclust_dot_mst_dist, 3}, {"_genieclust_dot_genie", (DL_FUNC) &_genieclust_dot_genie, 6}, {"_genieclust_dot_gclust", (DL_FUNC) &_genieclust_dot_gclust, 3}, {"_genieclust_gini_index", (DL_FUNC) &_genieclust_gini_index, 1}, {"_genieclust_bonferroni_index", (DL_FUNC) &_genieclust_bonferroni_index, 1}, {NULL, NULL, 0} }; RcppExport void R_init_genieclust(DllInfo *dll) { R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); R_useDynamicSymbols(dll, FALSE); } genieclust/src/c_compare_partitions.h0000644000176200001440000003030414002205473017555 0ustar liggesusers/* External Cluster Validity Measures * * Adjusted- and Nonadjusted Rand Score, * Adjusted- and Nonadjusted Fowlkes-Mallows Score, * Adjusted-, Normalised and Nonadjusted Mutual Information Score, * Normalised Accuracy, Pair Sets Index * (for vectors of "small" ints) * * * References * ========== * * Hubert L., Arabie P., Comparing Partitions, * Journal of Classification 2(1), 1985, pp. 193-218, esp. Eqs. (2) and (4) * * Vinh N.X., Epps J., Bailey J., * Information theoretic measures for clusterings comparison: * Variants, properties, normalization and correction for chance, * Journal of Machine Learning Research 11, 2010, pp. 2837-2854. * * Rezaei M., Franti P., Set matching measures for external cluster validity, * IEEE Transactions on Knowledge and Data Mining 28(8), 2016, pp. 2173-2186, * doi:10.1109/TKDE.2016.2551240 * * * Copyleft (C) 2018-2021, Marek Gagolewski * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License * Version 3, 19 November 2007, published by the Free Software Foundation. * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License Version 3 for more details. * You should have received a copy of the License along with this program. * If this is not the case, refer to . */ #ifndef __c_compare_partitions_h #define __c_compare_partitions_h #include "c_common.h" #include "c_scipy_rectangular_lsap.h" #include #include #include /*! (t choose 2) * * @param t * @return t*(t-1.0)*0.5 */ inline double Ccomb2(double t) { return t*(t-1.0)*0.5; } /*! Computes both the minimum and the maximum in an array * * @param x c_contiguous array of length n * @param n length of x * @param xmin [out] the minimum of x * @param xmax [out] the maximum of x */ template void Cminmax(const T* x, ssize_t n, T* xmin, T* xmax) { *xmin = x[0]; *xmax = x[0]; for (ssize_t i=1; i *xmax) *xmax = x[i]; } } /*! * Stores AR and FM scores as well as their adjusted/normalised versions. */ struct CComparePartitionsPairsResult { double ar; double r; double fm; double afm; }; /*! * Stores mutual information-based scores */ struct CComparePartitionsInfoResult { double mi; double nmi; double ami; }; /*! Applies partial pivoting to a given confusion matrix - permutes the columns * so as to have the largest elements in each row on the main diagonal. * * This comes in handy whenever C actually summarises the results generated * by clustering algorithms, where actual label values do not matter * (e.g., (1, 2, 0) can be remapped to (0, 2, 1) with no change in meaning. * * * @param C [in/out] a c_contiguous confusion matrix of size xc*yc * @param xc number of rows in C * @param yc number of columns in C * * Note that C is modified in-place (overwritten). */ template void Capply_pivoting(T* C, ssize_t xc, ssize_t yc) { for (ssize_t i=0; i void Ccontingency_table(T* C, ssize_t xc, ssize_t yc, T xmin, T ymin, const T* x, const T* y, ssize_t n) { for (ssize_t j=0; j (x[i]-xmin)*yc +(y[i]-ymin)); C[(x[i]-xmin)*yc +(y[i]-ymin)]++; } } /*! Computes the adjusted and nonadjusted Rand- and FM scores * based on a given confusion matrix. * * References * ========== * * Hubert L., Arabie P., Comparing Partitions, * Journal of Classification 2(1), 1985, pp. 193-218, esp. Eqs. (2) and (4) * * @param C a c_contiguous confusion matrix of size xc*yc * @param xc number of rows in C * @param yc number of columns in C * * @return the computed scores */ template CComparePartitionsPairsResult Ccompare_partitions_pairs(const T* C, ssize_t xc, ssize_t yc) { double n = 0.0; // total sum (length of the underlying x and y = number of points) for (ssize_t ij=0; ij CComparePartitionsInfoResult Ccompare_partitions_info(const T* C, ssize_t xc, ssize_t yc) { double n = 0.0; // total sum (length of the underlying x and y = number of points) for (ssize_t ij=0; ij sum_x(xc); std::vector sum_y(yc); double h_x = 0.0, h_y = 0.0, h_x_cond_y = 0.0, h_x_y = 0.0; for (ssize_t i=0; i 0) h_x_y += C[i*yc+j]*std::log((double)C[i*yc+j]/(double)n); t += C[i*yc+j]; } sum_x[i] = t; if (t > 0) h_y += t*std::log((double)t/(double)n); } for (ssize_t j=0; j 0) h_x_cond_y += C[i*yc+j]*std::log((double)C[i*yc+j]/sum_x[i]); t += C[i*yc+j]; } sum_y[j] = t; if (t > 0) h_x += t*std::log((double)t/(double)n); } h_x = -h_x/(double)n; h_y = -h_y/(double)n; h_x_cond_y = -h_x_cond_y/(double)n; h_x_y = -h_x_y/(double)n; double e_mi = 0.0; for (ssize_t i=0; i double Ccompare_partitions_nacc(const T* C, ssize_t xc, ssize_t yc) { GENIECLUST_ASSERT(xc <= yc); double n = 0.0; // total sum (length of the underlying x and y = number of points) for (ssize_t ij=0; ij output_col4row(xc); ssize_t retval = linear_sum_assignment(C, xc, yc, output_col4row.data(), false); // minimise=false GENIECLUST_ASSERT(retval == 0); double t = 0.0; for (ssize_t i=0; i double Ccompare_partitions_psi(const T* C, ssize_t xc, ssize_t yc) { GENIECLUST_ASSERT(xc <= yc); double n = 0.0; // total sum (length of the underlying x and y = number of points) for (ssize_t ij=0; ij sum_x(xc); std::vector sum_y(yc); for (ssize_t i=0; i S(xc*yc); for (ssize_t i=0; i output_col4row2(xc); ssize_t retval = linear_sum_assignment(S.data(), xc, yc, output_col4row2.data(), false); // minimise=false GENIECLUST_ASSERT(retval == 0); double s = 0.0; for (ssize_t i=0; i= 3.7 (implied by `numpy`). - [Python] Require `nmslib`. - [R] Use `RcppMLPACK` directly; remove dependency on `emstreeR`. - [R] Use `tinytest` for unit testing instead of `testthat`. ## genieclust 0.9.4 (2020-07-31) - [Bugfix] [R] Fix build errors on Solaris. ## genieclust 0.9.3 (2020-07-25) - [Bugfix] [Python] Add code coverage CI. Fix some minor inconsistencies. Automate the `bdist` build chain. - [R] Update DESCRIPTION to meet the CRAN policies. ## genieclust 0.9.2 (2020-07-22) - [BUGFIX] [Python] Fix broken build script for OS X with no OpenMP. ## genieclust 0.9.1 (2020-07-18) - [General] The package has been completely rewritten. The core functionality is now implemented in C++ (with OpenMP). - [General] Clustering with respect to HDBSCAN*-like mutual reachability distances is supported. - [General] The parallelised Jarnik-Prim algorithm now supports on-the-fly distance computations. Euclidean minimum spanning tree can be determined with `mlpack`, which is much faster in low-dimensional spaces. - [R] R version is now available. - [Python] [Experimental] The GIc algorithm proposed by Anna Cena in her 2018 PhD thesis is added. - [Python] Approximate version based on nearest neighbour graphs produced by `nmslib` is added. ## genieclust 0.1a2 (2018-05-23) - [Python] Initial PyPI release. genieclust/R/0000755000176200001440000000000014040206351012610 5ustar liggesusersgenieclust/R/gclust.R0000644000176200001440000002660514002205001014232 0ustar liggesusers# This file is part of the genieclust package for R. # ############################################################################ # # # # Copyleft (C) 2020-2021, Marek Gagolewski # # # # # # This program is free software: you can redistribute it and/or modify # # it under the terms of the GNU Affero General Public License # # Version 3, 19 November 2007, published by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # # but WITHOUT ANY WARRANTY; without even the implied warranty of # # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # # GNU Affero General Public License Version 3 for more details. # # You should have received a copy of the License along with this program. # # If this is not the case, refer to . # # # # ############################################################################ # .correct_height <- function(height) { # correction for the departure from ultrametricity # TODO: allow one choose? # cumsum(height) rev(cummin(rev(height))) } #' @title The Genie++ Hierarchical Clustering Algorithm #' #' @description #' A reimplementation of \emph{Genie} - a robust and outlier resistant #' clustering algorithm (see Gagolewski, Bartoszuk, Cena, 2016). #' The Genie algorithm is based on a minimum spanning tree (MST) of the #' pairwise distance graph of a given point set. #' Just like single linkage, it consumes the edges #' of the MST in increasing order of weights. However, it prevents #' the formation of clusters of highly imbalanced sizes; once the Gini index #' (see \code{\link{gini_index}()}) of the cluster size distribution #' raises above \code{gini_threshold}, a forced merge of a point group #' of the smallest size is performed. Its appealing simplicity goes hand #' in hand with its usability; Genie often outperforms #' other clustering approaches on benchmark data, #' such as \url{https://github.com/gagolews/clustering_benchmarks_v1}. #' #' The clustering can now also be computed with respect to the #' mutual reachability distance (based, e.g., on the Euclidean metric), #' which is used in the definition of the HDBSCAN* algorithm #' (see Campello et al., 2015). If \code{M} > 1, then the mutual reachability #' distance \eqn{m(i,j)} with smoothing factor \code{M} is used instead of the #' chosen "raw" distance \eqn{d(i,j)}. It holds \eqn{m(i,j)=\max(d(i,j), c(i), c(j))}, #' where \eqn{c(i)} is \eqn{d(i,k)} with \eqn{k} being the #' (\code{M}-1)-th nearest neighbour of \eqn{i}. #' This makes "noise" and "boundary" points being "pulled away" from each other. #' #' The Genie correction together with the smoothing factor \code{M} > 1 (note that #' \code{M} = 2 corresponds to the original distance) gives a robustified version of #' the HDBSCAN* algorithm that is able to detect a predefined number of #' clusters. Hence it does not dependent on the DBSCAN's somewhat magical #' \code{eps} parameter or the HDBSCAN's \code{min_cluster_size} one. #' #' #' @details #' Note that as in the case of all the distance-based methods, #' the standardisation of the input features is definitely worth giving a try. #' #' If \code{d} is a numeric matrix or an object of class \code{dist}, #' \code{\link{mst}()} will be called to compute an MST, which generally #' takes at most \eqn{O(n^2)} time (the algorithm we provide is parallelised, #' environment variable \code{OMP_NUM_THREADS} controls the number of threads #' in use). However, see \code{\link{emst_mlpack}()} for a very fast alternative #' in the case of Euclidean spaces of (very) low dimensionality and \code{M} = 1. #' #' Given an minimum spanning tree, the algorithm runs in \eqn{O(n \sqrt{n})} time. #' Therefore, if you want to test different \code{gini_threshold}s, #' (or \code{k}s), it is best to explicitly compute the MST first. #' #' According to the algorithm's original definition, #' the resulting partition tree (dendrogram) might violate #' the ultrametricity property (merges might occur at levels that #' are not increasing w.r.t. a between-cluster distance). #' Departures from ultrametricity are corrected by applying #' \code{height = rev(cummin(rev(height)))}. #' #' #' @param d a numeric matrix (or an object coercible to one, #' e.g., a data frame with numeric-like columns) or an #' object of class \code{dist}, see \code{\link[stats]{dist}} #' or an object of class \code{mst}, see \code{\link{mst}()}. #' @param gini_threshold threshold for the Genie correction, i.e., #' the Gini index of the cluster size distribution; #' Threshold of 1.0 disables the correction. #' Low thresholds highly penalise the formation of small clusters. #' @param distance metric used to compute the linkage, one of: #' \code{"euclidean"} (synonym: \code{"l2"}), #' \code{"manhattan"} (a.k.a. \code{"l1"} and \code{"cityblock"}), #' \code{"cosine"}. #' @param verbose logical; whether to print diagnostic messages #' and progress information. #' @param cast_float32 logical; whether to compute the distances using 32-bit #' instead of 64-bit precision floating-point arithmetic (up to 2x faster). #' @param ... further arguments passed to other methods. #' @param k the desired number of clusters to detect, \code{k} = 1 with \code{M} > 1 #' acts as a noise point detector. #' @param detect_noise whether the minimum spanning tree's leaves #' should be marked as noise points, defaults to \code{TRUE} if \code{M} > 1 #' for compatibility with HDBSCAN*. #' @param M smoothing factor; \code{M} <= 2 gives the selected \code{distance}; #' otherwise, the mutual reachability distance is used. #' @param postprocess one of \code{"boundary"} (default), \code{"none"} #' or \code{"all"}; in effect only if \code{M} > 1. #' By default, only "boundary" points are merged #' with their nearest "core" points (A point is a boundary point if it is #' a noise point and it's amongst its adjacent vertex's #' \code{M}-1 nearest neighbours). To force a classical #' k-partition of a data set (with no notion of noise), #' choose "all". #' #' #' @return #' \code{gclust()} computes the whole clustering hierarchy; it #' returns a list of class \code{hclust}, #' see \code{\link[stats]{hclust}}. Use \code{link{cutree}()} to obtain #' an arbitrary k-partition. #' #' \code{genie()} returns a \code{k}-partition - a vector with elements in 1,...,k, #' whose i-th element denotes the i-th input point's cluster identifier. #' Missing values (\code{NA}) denote noise points (if \code{detect_noise} #' is \code{TRUE}). #' #' @seealso #' \code{\link{mst}()} for the minimum spanning tree routines. #' #' \code{\link{adjusted_rand_score}()} (amongst others) for external #' cluster validity measures (partition similarity scores). #' #' #' @references #' Gagolewski M., Bartoszuk M., Cena A., #' Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, #' \emph{Information Sciences} 363, 2016, 8-23. #' #' Campello R., Moulavi D., Zimek A., Sander J., #' Hierarchical density estimates for data clustering, visualization, #' and outlier detection, #' ACM Transactions on Knowledge Discovery from Data 10(1), 2015, 5:1–5:51. #' #' #' @examples #' library("datasets") #' data("iris") #' X <- iris[1:4] #' h <- gclust(X) #' y_pred <- cutree(h, 3) #' y_test <- iris[,5] #' plot(iris[,2], iris[,3], col=y_pred, #' pch=as.integer(iris[,5]), asp=1, las=1) #' adjusted_rand_score(y_test, y_pred) #' pair_sets_index(y_test, y_pred) #' #' # Fast for low-dimensional Euclidean spaces: #' h <- gclust(emst_mlpack(X)) #' #' @rdname gclust #' @export gclust <- function(d, ...) { UseMethod("gclust") } #' @export #' @rdname gclust #' @method gclust default gclust.default <- function(d, gini_threshold=0.3, distance=c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"), cast_float32=TRUE, verbose=FALSE, ...) { stopifnot(gini_threshold >= 0.0, gini_threshold <= 1.0) distance <- match.arg(distance) d <- as.matrix(d) gclust.mst(mst.default(d, M=1L, distance=distance, verbose=verbose, cast_float32=cast_float32), gini_threshold=gini_threshold, verbose=verbose) } #' @export #' @rdname gclust #' @method gclust dist gclust.dist <- function(d, gini_threshold=0.3, verbose=FALSE, ...) { stopifnot(gini_threshold >= 0.0, gini_threshold <= 1.0) gclust.mst(mst.dist(d, M=1L, verbose=verbose), gini_threshold=gini_threshold, verbose=verbose) } #' @export #' @rdname gclust #' @method gclust mst gclust.mst <- function(d, gini_threshold=0.3, verbose=FALSE, ...) { stopifnot(gini_threshold >= 0.0, gini_threshold <= 1.0) result <- .gclust(d, gini_threshold, verbose) result[["height"]] <- .correct_height(result[["height"]]) result[["labels"]] <- attr(d, "Labels") # yes, >L1L, cast_float32=TRUE, verbose=FALSE, ...) { stopifnot(gini_threshold >= 0.0, gini_threshold <= 1.0) postprocess <- match.arg(postprocess) distance <- match.arg(distance) d <- as.matrix(d) genie.mst(mst.default(d, M=M, distance=distance, verbose=verbose, cast_float32=cast_float32), k=k, gini_threshold=gini_threshold, postprocess=postprocess, detect_noise=detect_noise, verbose=verbose) } #' @export #' @rdname gclust #' @method genie dist genie.dist <- function(d, k, gini_threshold=0.3, M=1L, postprocess=c("boundary", "none", "all"), detect_noise=M>1L, verbose=FALSE, ...) { stopifnot(gini_threshold >= 0.0, gini_threshold <= 1.0) postprocess <- match.arg(postprocess) genie.mst(mst.dist(d, M=M, verbose=verbose), k=k, gini_threshold=gini_threshold, postprocess=postprocess, detect_noise=detect_noise, verbose=verbose) } #' @export #' @rdname gclust #' @method genie mst genie.mst <- function(d, k, gini_threshold=0.3, postprocess=c("boundary", "none", "all"), detect_noise=FALSE, verbose=FALSE, ...) { stopifnot(gini_threshold >= 0.0, gini_threshold <= 1.0) postprocess <- match.arg(postprocess) structure( .genie(d, k, gini_threshold, postprocess, detect_noise, verbose), names=attr(d, "Labels") ) } registerS3method("gclust", "default", "gclust.default") registerS3method("gclust", "dist", "gclust.dist") registerS3method("gclust", "mst", "gclust.mst") registerS3method("genie", "default", "genie.default") registerS3method("genie", "dist", "genie.dist") registerS3method("genie", "mst", "genie.mst") genieclust/R/mst.R0000644000176200001440000002063314040206030013534 0ustar liggesusers# This file is part of the genieclust package for R. # ############################################################################ # # # # Copyleft (C) 2020-2021, Marek Gagolewski # # # # # # This program is free software: you can redistribute it and/or modify # # it under the terms of the GNU Affero General Public License # # Version 3, 19 November 2007, published by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # # but WITHOUT ANY WARRANTY; without even the implied warranty of # # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # # GNU Affero General Public License Version 3 for more details. # # You should have received a copy of the License along with this program. # # If this is not the case, refer to . # # # # ############################################################################ # #' @title Minimum Spanning Tree of the Pairwise Distance Graph #' #' @description #' An parallelised implementation of a Jarník (Prim/Dijkstra)-like #' algorithm for determining #' a(*) minimum spanning tree (MST) of a complete undirected graph #' representing a set of n points #' with weights given by a pairwise distance matrix. #' #' (*) Note that there might be multiple minimum trees spanning a given graph. #' #' @details #' If \code{d} is a numeric matrix of size \eqn{n p}, #' the \eqn{n (n-1)/2} distances are computed on the fly, so that \eqn{O(n M)} #' memory is used. #' #' #' The algorithm is parallelised; set the \code{OMP_NUM_THREADS} environment #' variable \code{\link[base]{Sys.setenv}} to control the number of threads #' used. #' #' Time complexity is \eqn{O(n^2)} for the method accepting an object of #' class \code{dist} and \eqn{O(p n^2)} otherwise. #' #' If \code{M} >= 2, then the mutual reachability distance \eqn{m(i,j)} with smoothing #' factor \code{M} (see Campello et al. 2015) #' is used instead of the chosen "raw" distance \eqn{d(i,j)}. #' It holds \eqn{m(i, j)=\max(d(i,j), c(i), c(j))}, where \eqn{c(i)} is #' \eqn{d(i, k)} with \eqn{k} being the (\code{M}-1)-th nearest neighbour of \eqn{i}. #' This makes "noise" and "boundary" points being "pulled away" from each other. #' Genie++ clustering algorithm (see \code{\link{gclust}}) #' with respect to the mutual reachability distance gains the ability to #' identify some observations are noise points. #' #' Note that the case \code{M} = 2 corresponds to the original distance, but we are #' determining the 1-nearest neighbours separately as well, which is a bit #' suboptimal; you can file a feature request if this makes your data analysis #' tasks too slow. #' #' #' @seealso #' \code{\link{emst_mlpack}()} for a very fast alternative #' in case of (very) low-dimensional Euclidean spaces (and \code{M} = 1). #' #' #' @references #' Jarník V., O jistém problému minimálním, #' Práce Moravské Přírodovědecké Společnosti 6 (1930) 57–63. #' #' Olson C.F., Parallel algorithms for hierarchical clustering, #' Parallel Comput. 21 (1995) 1313–1325. #' #' Prim R., Shortest connection networks and some generalisations, #' Bell Syst. Tech. J. 36 (1957) 1389–1401. #' #' Campello R., Moulavi D., Zimek A., Sander J., #' Hierarchical density estimates for data clustering, visualization, #' and outlier detection, #' ACM Transactions on Knowledge Discovery from Data 10(1) (2015) 5:1–5:51. #' #' #' @param d either a numeric matrix (or an object coercible to one, #' e.g., a data frame with numeric-like columns) or an #' object of class \code{dist}, see \code{\link[stats]{dist}}. #' @param distance metric used to compute the linkage, one of: #' \code{"euclidean"} (synonym: \code{"l2"}), #' \code{"manhattan"} (a.k.a. \code{"l1"} and \code{"cityblock"}), #' \code{"cosine"}. #' @param M smoothing factor; \code{M} = 1 gives the selected \code{distance}; #' otherwise, the mutual reachability distance is used. #' @param verbose logical; whether to print diagnostic messages #' and progress information. #' @param cast_float32 logical; whether to compute the distances using 32-bit #' instead of 64-bit precision floating-point arithmetic (up to 2x faster). #' @param ... further arguments passed to or from other methods. #' #' @return #' Matrix of class \code{mst} with n-1 rows and 3 columns: #' \code{from}, \code{to} and \code{dist}. It holds \code{from} < \code{to}. #' Moreover, \code{dist} is sorted nondecreasingly. #' The i-th row gives the i-th edge of the MST. #' \code{(from[i], to[i])} defines the vertices (in 1,...,n) #' and \code{dist[i]} gives the weight, i.e., the #' distance between the corresponding points. #' #' The \code{method} attribute gives the name of the distance used. #' The \code{Labels} attribute gives the labels of all the input points. #' #' If \code{M} > 1, the \code{nn} attribute gives the indices of the \code{M}-1 #' nearest neighbours of each point. #' #' @examples #' library("datasets") #' data("iris") #' X <- iris[1:4] #' tree <- mst(X) #' #' @rdname mst #' @export mst <- function(d, ...) { UseMethod("mst") } #' @export #' @rdname mst #' @method mst default mst.default <- function(d, distance=c("euclidean", "l2", "manhattan", "cityblock", "l1", "cosine"), M=1L, cast_float32=TRUE, verbose=FALSE, ...) { distance <- match.arg(distance) d <- as.matrix(d) result <- .mst.default(d, distance, M, cast_float32, verbose) attr(result, "method") <- if (M == 1L) distance else sprintf("mutual reachability distance (%s, M=%d)", distance, M) attr(result, "Labels") <- dimnames(d)[[1]] class(result) <- "mst" result } #' @export #' @rdname mst #' @method mst dist mst.dist <- function(d, M=1L, verbose=FALSE, ...) { result <- .mst.dist(d, M, verbose) attr(result, "method") <- if (M == 1L) attr(d, "method") else sprintf("mutual reachability distance (%s, M=%d)", attr(d, "method"), M) attr(result, "Labels") <- attr(d, "Labels") class(result) <- "mst" result } registerS3method("mst", "default", "mst.default") registerS3method("mst", "dist", "mst.dist") #' @title Euclidean Minimum Spanning Tree #' #' @description #' Provides access to the implementation of the Dual-Tree Borůvka #' algorithm from the \code{mlpack} package (if available). #' It is based on kd-trees and is fast for (very) low-dimensional #' Euclidean spaces. For higher dimensional spaces (say, over 5 features) #' or other metrics, use the parallelised Prim-like algorithm implemented #' in \code{\link{mst}()}. #' #' #' @param X a numeric matrix (or an object coercible to one, #' e.g., a data frame with numeric-like columns) #' @param leaf_size size of leaves in the kd-tree, #' controls the trade-off between speed and memory consumption #' @param naive logical; whether to use the naive, quadratic-time algorithm #' @param verbose logical; whether to print diagnostic messages #' #' @return #' An object of class \code{mst}, see \code{\link{mst}()} for details. #' #' @references #' March W.B., Ram P., Gray A.G., #' Fast Euclidean Minimum Spanning Tree: Algorithm, Analysis, and Applications, #' Proc. ACM SIGKDD'10 (2010) 603-611, \url{https://mlpack.org/papers/emst.pdf}. #' #' Curtin R.R., Edel M., Lozhnikov M., Mentekidis Y., Ghaisas S., Zhang S., #' mlpack 3: A fast, flexible machine learning library, #' Journal of Open Source Software 3(26), 726, 2018. #' #' @export emst_mlpack <- function(X, leaf_size=1, naive=FALSE, verbose=FALSE) { X <- as.matrix(X) if (!requireNamespace("mlpack", quietly=TRUE)) { warning("Package `mlpack` is not installed. Using mst() instead.") return(mst.default(X, verbose=verbose, cast_float32=FALSE)) } mst <- mlpack::emst(X, leaf_size=leaf_size, naive=naive, verbose=verbose)$output mst[, 1] <- mst[, 1] + 1 # 0-based -> 1-based indexing mst[, 2] <- mst[, 2] + 1 # 0-based -> 1-based indexing stopifnot(mst[, 1] < mst[, 2]) stopifnot(!is.unsorted(mst[, 3])) structure( mst, class="mst", method="euclidean", Labels=dimnames(X)[[1]] ) } genieclust/R/genieclust-package.R0000644000176200001440000000345214040202112016461 0ustar liggesusers# This file is part of the genieclust package for R. # ############################################################################ # # # # Copyleft (C) 2020-2021, Marek Gagolewski # # # # # # This program is free software: you can redistribute it and/or modify # # it under the terms of the GNU Affero General Public License # # Version 3, 19 November 2007, published by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # # but WITHOUT ANY WARRANTY; without even the implied warranty of # # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # # GNU Affero General Public License Version 3 for more details. # # You should have received a copy of the License along with this program. # # If this is not the case, refer to . # # # # ############################################################################ # #' @title The Genie++ Hierarchical Clustering Algorithm (with Extras) #' #' @description #' See \code{\link{genie}()} for more details. #' #' #' @name genieclust-package #' @rdname genieclust-package #' @aliases genieclust #' @docType package #' @author Marek Gagolewski #' #' @useDynLib genieclust, .registration=TRUE #' @importFrom Rcpp evalCpp #' @importFrom stats hclust #' @importFrom stats cutree #' @importFrom stats dist #' @importFrom utils capture.output invisible(NULL) genieclust/R/RcppExports.R0000644000176200001440000002124214040206351015225 0ustar liggesusers# Generated by using Rcpp::compileAttributes() -> do not edit by hand # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 #' @title Pairwise Partition Similarity Scores #' #' @description #' Let \code{x} and \code{y} represent two partitions of a set of \eqn{n} #' elements into, respectively, \eqn{K} and \eqn{L} #' nonempty and pairwise disjoint subsets. #' For instance, these can be two clusterings of a dataset with #' \eqn{n} observations specified by two vectors of labels. #' The functions described in this section quantify the similarity between #' \code{x} and \code{y}. They can be used as external cluster #' validity measures, i.e., in the presence of reference (ground-truth) #' partitions. #' #' @details #' Every index except \code{mi_score()} (which computes the mutual #' information score) outputs 1 given two identical partitions. #' Note that partitions are always defined up to a bijection of the set of #' possible labels, e.g., (1, 1, 2, 1) and (4, 4, 2, 4) #' represent the same 2-partition. #' #' \code{rand_score()} gives the Rand score (the "probability" of agreement #' between the two partitions) and #' \code{adjusted_rand_score()} is its version corrected for chance, #' see (Hubert, Arabie, 1985), #' its expected value is 0.0 given two independent partitions. #' Due to the adjustment, the resulting index might also be negative #' for some inputs. #' #' Similarly, \code{fm_score()} gives the Fowlkes-Mallows (FM) score #' and \code{adjusted_fm_score()} is its adjusted-for-chance version, #' see (Hubert, Arabie, 1985). #' #' Note that both the (unadjusted) Rand and FM scores are bounded from below #' by \eqn{1/(K+1)} if \eqn{K=L}, hence their adjusted versions are preferred. #' #' \code{mi_score()}, \code{adjusted_mi_score()} and #' \code{normalized_mi_score()} are information-theoretic #' scores, based on mutual information, #' see the definition of \eqn{AMI_{sum}} and \eqn{NMI_{sum}} #' in (Vinh et al., 2010). #' #' \code{normalized_accuracy()} is defined as #' \eqn{(Accuracy(C_\sigma)-1/L)/(1-1/L)}, where \eqn{C_\sigma} is a version #' of the confusion matrix for given \code{x} and \code{y}, #' \eqn{K \leq L}, with columns permuted based on the solution to the #' Maximal Linear Sum Assignment Problem. #' \eqn{Accuracy(C_\sigma)} is sometimes referred to as Purity, #' e.g., in (Rendon et al. 2011). #' #' \code{pair_sets_index()} gives the Pair Sets Index (PSI) #' adjusted for chance (Rezaei, Franti, 2016), \eqn{K \leq L}. #' Pairing is based on the solution to the Linear Sum Assignment Problem #' of a transformed version of the confusion matrix. #' #' @references #' Hubert L., Arabie P., Comparing Partitions, #' Journal of Classification 2(1), 1985, 193-218, esp. Eqs. (2) and (4). #' #' Rendon E., Abundez I., Arizmendi A., Quiroz E.M., #' Internal versus external cluster validation indexes, #' International Journal of Computers and Communications 5(1), 2011, 27-34. #' #' Rezaei M., Franti P., Set matching measures for external cluster validity, #' IEEE Transactions on Knowledge and Data Mining 28(8), 2016, 2173-2186. #' #' Vinh N.X., Epps J., Bailey J., #' Information theoretic measures for clusterings comparison: #' Variants, properties, normalization and correction for chance, #' Journal of Machine Learning Research 11, 2010, 2837-2854. #' #' #' @param x an integer vector of length n (or an object coercible to) #' representing a K-partition of an n-set, #' or a confusion matrix with K rows and L columns (see \code{table(x, y)}) #' #' @param y an integer vector of length n (or an object coercible to) #' representing an L-partition of the same set), #' or NULL (if x is an K*L confusion matrix) #' #' @return A single real value giving the similarity score. #' #' @examples #' y_true <- iris[[5]] #' y_pred <- kmeans(as.matrix(iris[1:4]), 3)$cluster #' adjusted_rand_score(y_true, y_pred) #' rand_score(table(y_true, y_pred)) # the same #' adjusted_fm_score(y_true, y_pred) #' fm_score(y_true, y_pred) #' mi_score(y_true, y_pred) #' normalized_mi_score(y_true, y_pred) #' adjusted_mi_score(y_true, y_pred) #' normalized_accuracy(y_true, y_pred) #' pair_sets_index(y_true, y_pred) #' #' @rdname comparing_partitions #' @export adjusted_rand_score <- function(x, y = NULL) { .Call(`_genieclust_adjusted_rand_score`, x, y) } #' @rdname comparing_partitions #' @export rand_score <- function(x, y = NULL) { .Call(`_genieclust_rand_score`, x, y) } #' @rdname comparing_partitions #' @export adjusted_fm_score <- function(x, y = NULL) { .Call(`_genieclust_adjusted_fm_score`, x, y) } #' @rdname comparing_partitions #' @export fm_score <- function(x, y = NULL) { .Call(`_genieclust_fm_score`, x, y) } #' @rdname comparing_partitions #' @export mi_score <- function(x, y = NULL) { .Call(`_genieclust_mi_score`, x, y) } #' @rdname comparing_partitions #' @export normalized_mi_score <- function(x, y = NULL) { .Call(`_genieclust_normalized_mi_score`, x, y) } #' @rdname comparing_partitions #' @export adjusted_mi_score <- function(x, y = NULL) { .Call(`_genieclust_adjusted_mi_score`, x, y) } #' @rdname comparing_partitions #' @export normalized_accuracy <- function(x, y = NULL) { .Call(`_genieclust_normalized_accuracy`, x, y) } #' @rdname comparing_partitions #' @export pair_sets_index <- function(x, y = NULL) { .Call(`_genieclust_pair_sets_index`, x, y) } .mst.default <- function(X, distance = "euclidean", M = 1L, cast_float32 = TRUE, verbose = FALSE) { .Call(`_genieclust_dot_mst_default`, X, distance, M, cast_float32, verbose) } .mst.dist <- function(d, M = 1L, verbose = FALSE) { .Call(`_genieclust_dot_mst_dist`, d, M, verbose) } .genie <- function(mst, k, gini_threshold, postprocess, detect_noise, verbose) { .Call(`_genieclust_dot_genie`, mst, k, gini_threshold, postprocess, detect_noise, verbose) } .gclust <- function(mst, gini_threshold, verbose) { .Call(`_genieclust_dot_gclust`, mst, gini_threshold, verbose) } #' @title Inequity (Inequality) Measures #' #' @description #' \code{gini_index()} gives the normalised Gini index #' and \code{bonferroni_index()} implements the Bonferroni index. #' #' @details #' Both indices can be used to quantify the "inequity" of a numeric sample. #' They can be perceived as measures of data dispersion. #' For constant vectors (perfect equity), the indices yield values of 0. #' Vectors with all elements but one equal to 0 (perfect inequity), #' are assigned scores of 1. #' Both indices follow the Pigou-Dalton principle (are Schur-convex): #' setting \eqn{x_i = x_i - h} and \eqn{x_j = x_j + h} with \eqn{h > 0} #' and \eqn{x_i - h \geq x_j + h} (taking from the "rich" and #' giving to the "poor") decreases the inequity. #' #' These indices have applications in economics, amongst others. #' The Gini clustering algorithm uses the Gini index as a measure #' of the inequality of cluster sizes. #' #' #' The normalised Gini index is given by: #' \deqn{ #' G(x_1,\dots,x_n) = \frac{ #' \sum_{i=1}^{n-1} \sum_{j=i+1}^n |x_i-x_j| #' }{ #' (n-1) \sum_{i=1}^n x_i #' }. #' } #' #' The normalised Bonferroni index is given by: #' \deqn{ #' B(x_1,\dots,x_n) = \frac{ #' \sum_{i=1}^{n} (n-\sum_{j=1}^i \frac{n}{n-j+1}) #' x_{\sigma(n-i+1)} #' }{ #' (n-1) \sum_{i=1}^n x_i #' }. #' } #' #' #' Time complexity: \eqn{O(n)} for sorted (increasingly) data. #' Otherwise, the vector will be sorted. #' #' In particular, for ordered inputs, it holds: #' \deqn{ #' G(x_1,\dots,x_n) = \frac{ #' \sum_{i=1}^{n} (n-2i+1) x_{\sigma(n-i+1)} #' }{ #' (n-1) \sum_{i=1}^n x_i #' }, #' } #' where \eqn{\sigma} is an ordering permutation of \eqn{(x_1,\dots,x_n)}. #' #' #' @references #' Bonferroni C., Elementi di Statistica Generale, Libreria Seber, #' Firenze, 1930. #' #' Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and #' outlier-resistant hierarchical clustering algorithm, #' Information Sciences 363, 2016, pp. 8-23. doi:10.1016/j.ins.2016.05.003 #' #' Gini C., Variabilita e Mutabilita, Tipografia di Paolo Cuppini, Bologna, 1912. #' #' #' @param x numeric vector of non-negative values #' #' @return The value of the inequity index, a number in \eqn{[0, 1]}. #' #' @examples #' gini_index(c(2, 2, 2, 2, 2)) # no inequality #' gini_index(c(0, 0, 10, 0, 0)) # one has it all #' gini_index(c(7, 0, 3, 0, 0)) # give to the poor, take away from the rich #' gini_index(c(6, 0, 3, 1, 0)) # (a.k.a. Pigou-Dalton principle) #' bonferroni_index(c(2, 2, 2, 2, 2)) #' bonferroni_index(c(0, 0, 10, 0, 0)) #' bonferroni_index(c(7, 0, 3, 0, 0)) #' bonferroni_index(c(6, 0, 3, 1, 0)) #' #' @rdname inequity #' @export gini_index <- function(x) { .Call(`_genieclust_gini_index`, x) } #' @rdname inequity #' @export bonferroni_index <- function(x) { .Call(`_genieclust_bonferroni_index`, x) } genieclust/MD50000644000176200001440000000341214040221650012716 0ustar liggesusersd485051e578d28d621c6669e222df7e0 *DESCRIPTION a2ec20536b8a46381a58a41b9e3dc887 *NAMESPACE 3b3d121dd80915d30eda5d782a102fc0 *NEWS e6583443cde8a02d0649eebbce0ac431 *R/RcppExports.R 46834463b2f6c31032ed1a09c6227cc4 *R/gclust.R 220d82b2a4466a6d17dfdbfc028e447d *R/genieclust-package.R d9830bd42cfd875e2ae59b04455c71f1 *R/mst.R 4a42c6eb0a0b0cb7b3e03528671b2d54 *inst/CITATION 818455c165e7c24f90a349b83852ad8c *man/comparing_partitions.Rd 88c046d484091ec825fd8c17c1bfa2cb *man/emst_mlpack.Rd 64229d8430edf4398174863da12d7021 *man/gclust.Rd 2d42ddc36f3125513a5a39b621065f0a *man/genieclust-package.Rd 2485de8983b2c22ef4d39cc3cc079bb6 *man/inequity.Rd 51c06cf849aa9b556825d0757ec66d4e *man/mst.Rd f4da2b6208222f241f8d8e7db655e5a4 *src/Makevars f4da2b6208222f241f8d8e7db655e5a4 *src/Makevars.win 481dedec6f545f9263ca3ffaacaba67b *src/RcppExports.cpp c5ccebf67601216b3807c0eda5ef4b7a *src/c_argfuns.h 9d739b8825921b60a39faa02b746fe9a *src/c_common.h 1654de23149e3fe497cd55e62215c880 *src/c_compare_partitions.h cdee902233e4998d0f30311a1c73b192 *src/c_disjoint_sets.h 88ad7ff513ec5edba336ab2d05016520 *src/c_distance.h e1409f6616641327dfb3ba00c4defc74 *src/c_genie.h 5e6f3adcee89d5c1d5ef7fd46608cccd *src/c_gini_disjoint_sets.h 66a703758a46f0592fcefdac7d0a704d *src/c_inequity.h 12b2fb014abdb9b63e9c6ceae2baeb2a *src/c_int_dict.h 54b7ccf3e84dd8ede0aee306da4f7000 *src/c_matrix.h 7ab7b01de08f3a44aca166910ab6ec48 *src/c_mst.h 10cfe5caa3f462f2a52433dc94918cf9 *src/c_postprocess.h 57dcced933e8e2c4e5459084c06d0334 *src/c_preprocess.h c0ee19c026d36038a354f89843f44095 *src/c_scipy_rectangular_lsap.h 310f978269e43f255bbd6c2cc8e4411c *src/r_compare_partitions.cpp 389742a4b5fac0887d77bd6006fffb91 *src/r_emst.cpp aaf9e40a04c4afffa663890c18196c91 *src/r_gclust.cpp c8f6c61130c759df1060ed2d869eb70e *src/r_inequity.cpp genieclust/inst/0000755000176200001440000000000014040176710013371 5ustar liggesusersgenieclust/inst/CITATION0000644000176200001440000000145614040176710014534 0ustar liggesusersbibentry( bibtype = "manual", title = "genieclust: Fast and robust hierarchical clustering with noise point detection", author = personList( as.person("Marek Gagolewski") ), year = sub("-.*", "", meta$Date), note = sprintf("R package version %s", meta$Version), url = "https://genieclust.gagolewski.com" ) bibentry( bibtype = "article", title = "Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm", author = personList( as.person("Marek Gagolewski"), as.person("Maciej Bartoszuk"), as.person("Anna Cena") ), journal = "Information Sciences", year = "2016", volume = "363", pages = "8--23", doi = "10.1016/j.ins.2016.05.003" )