genie/0000755000176200001440000000000013711633542011346 5ustar liggesusersgenie/NAMESPACE0000644000176200001440000000032013711501666012561 0ustar liggesusers# Generated by roxygen2: do not edit by hand export(hclust2) importFrom(Rcpp,evalCpp) importFrom(genieclust,gclust) importFrom(genieclust,genie) importFrom(stats,approx) useDynLib(genie, .registration=TRUE) genie/man/0000755000176200001440000000000013431502657012122 5ustar liggesusersgenie/man/genie-package.Rd0000644000176200001440000000046513706445240015075 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/genie-package.R \docType{package} \name{genie-package} \alias{genie-package} \alias{genie} \title{The Genie Package} \description{ See \code{\link{hclust2}()} for details. } \author{ Marek Gagolewski, Maciej Bartoszuk, Anna Cena } genie/man/hclust2.Rd0000644000176200001440000001164013711501575013776 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/hclust2.R \name{hclust2} \alias{hclust2} \title{Fast Hierarchical Clustering in Spaces Equipped With a Dissimilarity Measure} \usage{ hclust2(d = NULL, objects = NULL, thresholdGini = 0.3, useVpTree = FALSE, ...) } \arguments{ \item{d}{an object of class \code{\link[stats]{dist}}, \code{NULL}, or a single string, see below} \item{objects}{\code{NULL}, numeric matrix, a list, or a character vector} \item{thresholdGini}{single numeric value in [0,1], threshold for the Gini index, 1 gives the standard single linkage algorithm} \item{useVpTree}{single logical value, whether to use a vantage-point tree to speed up nearest neighbour searching in low-dimensional spaces} \item{...}{internal parameters used to tune up the algorithm} } \value{ A named list of class \code{hclust}, see \code{\link[stats]{hclust}}, with additional components: \itemize{ \item \code{stats} - performance statistics \item \code{control} - internal parameters used } } \description{ The reference implementation of the fast, robust and outlier resistant Genie algorithm described in (Gagolewski, Bartoszuk, Cena, 2016). Note that the \code{genie} package has been superseded by \code{genieclust}, see \code{\link[genieclust]{gclust}} and \code{\link[genieclust]{genie}} for more details. } \details{ The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algorithm is very sensitive to outliers, produces highly skewed dendrograms, and therefore usually does not reflect the true underlying data structure -- unless the clusters are well-separated. To overcome its limitations, in (Gagolewski, Bartoszuk, Cena, 2016) we proposed a new hierarchical clustering linkage criterion. Namely, our algorithm links two clusters in such a way that a chosen economic inequity measure (here, the Gini index) of the cluster sizes does not increase drastically above a given threshold. The benchmarks indicate a high practical usefulness of the introduced method: it most often outperforms the Ward or average linkage in terms of the clustering quality while retaining the single linkage speed. The algorithm can be run in parallel (via OpenMP) on multiple threads to speed up its execution further on. Its memory overhead is small: there is no need to precompute the complete distance matrix to perform the computations in order to obtain a desired clustering. For compatibility with \code{\link[stats]{hclust}}, \code{d} may be an object of class \code{\link[stats]{dist}}. In such a case, the \code{objects} argument is ignored. Note that such an object requires ca. \emph{8n(n-1)/2} bytes of computer's memory, where \emph{n} is the number of objects to cluster, and therefore this setting can be used to analyse data sets of sizes up to about 10,000-50,000. If \code{objects} is a character vector or a list, then \code{d} should be a single string, one of: \code{levenshtein} (or \code{NULL}), \code{hamming}, \code{dinu} (Dinu, Sgarro, 2006), or \code{euclinf} (Cena et al., 2015). Note that the list must consist either of integer or of numeric vectors only (depending on the dissimilarity measure of choice). On the other hand, each string must be in ASCII, but you can always convert it to UTF-32 with \code{\link[stringi]{stri_enc_toutf32}}. Otherwise, if \code{objects} is a numeric matrix (here, each row denotes a distinct observation), then \code{d} should be a single string, one of: \code{euclidean_squared} (or \code{NULL}), \code{euclidean} (which yields the same results as \code{euclidean_squared}) \code{manhattan}, \code{maximum}, or \code{hamming}. If \code{useVpTree} is \code{FALSE}, then the dissimilarity measure of choice is guaranteed to be computed for each unique pair of \code{objects} only once. } \examples{ library("datasets") data("iris") h <- hclust2(objects=as.matrix(iris[,2:3]), thresholdGini=0.2) plot(iris[,2], iris[,3], col=cutree(h, 3), pch=as.integer(iris[,5]), asp=1, las=1) } \references{ Cena A., Gagolewski M., Mesiar R., Problems and challenges of information resources producers' clustering, \emph{Journal of Informetrics} 9(2), 2015, pp. 273-284. Dinu L.P., Sgarro A., A Low-complexity Distance for DNA Strings, \emph{Fundamenta Informaticae} 73(3), 2006, pp. 361-372. Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, \emph{Information Sciences} 363, 2016, pp. 8-23. Gagolewski M., Cena A., Bartoszuk M. \emph{Hierarchical clustering via penalty-based aggregation and the Genie approach}, In: Torra V. et al. (Eds.), \emph{Modeling Decisions for Artificial Intelligence} (\emph{Lecture Notes in Artificial Intelligence} 9880), Springer, 2016. } genie/DESCRIPTION0000644000176200001440000000402313711633542013053 0ustar liggesusersPackage: genie Type: Package Title: Fast, Robust, and Outlier Resistant Hierarchical Clustering Version: 1.0.5 Date: 2020-08-02 Authors@R: c( person("Marek", "Gagolewski", role = c("aut", "cre", "cph"), email = "marek@gagolewski.com", comment = c(ORCID = "0000-0003-0637-6028")), person("Maciej", "Bartoszuk", role = "aut", email = "bartoszuk@rexamine.com", comment = c(ORCID = "0000-0001-6088-8273")), person("Anna", "Cena", role = "aut", email = "cena@rexamine.com", comment = c(ORCID = "0000-0001-8697-5383")) ) Description: Includes the reference implementation of Genie - a hierarchical clustering algorithm that links two point groups in such a way that an inequity measure (namely, the Gini index) of the cluster sizes does not significantly increase above a given threshold. This method most often outperforms many other data segmentation approaches in terms of clustering quality as tested on a wide range of benchmark datasets. At the same time, Genie retains the high speed of the single linkage approach, therefore it is also suitable for analysing larger data sets. For more details see (Gagolewski et al. 2016 ). For an even faster and more feature-rich implementation, including, amongst others, noise point detection, see the 'genieclust' package. License: GPL (>= 3) BugReports: http://github.com/gagolews/genie/issues URL: http://genieclust.gagolewski.com/ Depends: R (>= 3.3.0), stats, genieclust Imports: Rcpp (>= 1.0.0) Suggests: datasets, testthat, stringi LinkingTo: Rcpp (>= 1.0.0) SystemRequirements: OpenMP, C++11 RoxygenNote: 7.1.1 NeedsCompilation: yes Packaged: 2020-08-02 09:13:53 UTC; gagolews Author: Marek Gagolewski [aut, cre, cph] (), Maciej Bartoszuk [aut] (), Anna Cena [aut] () Maintainer: Marek Gagolewski Repository: CRAN Date/Publication: 2020-08-02 22:00:02 UTC genie/tests/0000755000176200001440000000000013241042746012506 5ustar liggesusersgenie/tests/testthat/0000755000176200001440000000000013711633542014350 5ustar liggesusersgenie/tests/testthat/test-single.R0000644000176200001440000000137613241042746016736 0ustar liggesuserslibrary("testthat") library("genie") library("stats") context("hclust2 vs single linkage") test_that("single_iris_distmat", { library("datasets") data("iris") d <- as.matrix(iris[,1:4]) d[,] <- jitter(d) # otherwise we get a non-unique solution d <- dist(d) h1 <- hclust2(d, thresholdGini=1.0) h2 <- hclust(d, method='single') expect_equal(h1$merge, h2$merge) expect_equal(h1$order, h2$order) }) test_that("single_iris_defaultdist", { library("datasets") data("iris") d <- as.matrix(iris[,2:3]) d[,] <- jitter(d) # otherwise we get a non-unique solution h1 <- hclust2(objects=d, thresholdGini=1.0) h2 <- hclust(dist(d), method='single') expect_equal(h1$merge, h2$merge) expect_equal(h1$order, h2$order) }) genie/tests/testthat.R0000644000176200001440000000007113241042746014467 0ustar liggesuserslibrary("testthat") library("genie") test_check("genie") genie/src/0000755000176200001440000000000013711501721012126 5ustar liggesusersgenie/src/hclust2_common.h0000644000176200001440000004350613324042165015245 0ustar liggesusers/* ************************************************************************* * * This file is part of the `genie` package for R. * * * * Copyright 2015-2018 Marek Gagolewski, Maciej Bartoszuk, Anna Cena * * * * 'genie' is free software: you can redistribute it and/or * * modify it under the terms of the GNU General Public License * * as published by the Free Software Foundation, either version 3 * * of the License, or (at your option) any later version. * * * * 'genie' is distributed in the hope that it will be useful, * * but WITHOUT ANY WARRANTY; without even the implied warranty of * * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * * * You should have received a copy of the GNU General Public License * * along with 'genie'. If not, see . * * ************************************************************************* */ #ifndef __HCLUST2_COMMON_H #define __HCLUST2_COMMON_H #include "defs.h" #include "hclust2_distance.h" #include "disjoint_sets.h" #include #include #include #include #include namespace grup { template struct Matrix { size_t nrow; size_t ncol; T* data; Matrix() : nrow(0), ncol(0), data(NULL) {} Matrix(size_t nrow, size_t ncol) : nrow(nrow), ncol(ncol) { data = new T[nrow*ncol]; } ~Matrix() { if (data) delete [] data; } Matrix(const Matrix& m) { nrow = m.nrow; ncol = m.ncol; data = new T[nrow*ncol]; for (size_t i=0; i= o2.dist; } }; typedef std::priority_queue, HeapNeighborItemFromSmallestComparator> priority_queue_HeapNeighborItem_FromSmallest; struct HeapHierarchicalItem { size_t index1; size_t index2; double dist; HeapHierarchicalItem() : index1(SIZE_MAX), index2(SIZE_MAX), dist(INFINITY) {} HeapHierarchicalItem(size_t index1, size_t index2, double dist) : index1(index1), index2(index2), dist(dist) {} inline bool operator<( const HeapHierarchicalItem& o ) const { return dist >= o.dist || (dist == o.dist && index2 > o.index2); // SIZE_MAX index2 at the end of a series } }; // class HclustPriorityQueue { // private: // struct BSTNode { // BSTNode* left; // BSTNode* right; // HeapHierarchicalItem elem; // }; // // BSTNode* root; // // void rotateLeft(BSTNode** root) { // STOPIFNOT(*root) // if (!(*root)->left) return; // BSTNode* oldroot = *root; // *root = oldroot->left; // oldroot->left = (*root)->right; // (*root)->right = oldroot; // } // // void rotateRight(BSTNode** root) { // STOPIFNOT(*root) // if (!(*root)->right) return; // BSTNode* oldroot = *root; // *root = oldroot->right; // oldroot->right = (*root)->left; // (*root)->left = oldroot; // } // // void deleteSubTree(BSTNode** root) { // if (!*root) return; // deleteSubTree(&(*root)->left); // deleteSubTree(&(*root)->right); // delete *root; // *root = NULL; // } // // void deleteLeftmost(BSTNode** root) { // STOPIFNOT(*root) // if ((*root)->left) { // deleteLeftmost(&(*root)->left); // } // else { // BSTNode* delme = *root; // *root = delme->right; // delete delme; // } // } // // void insert(BSTNode** root, const HeapHierarchicalItem& data) { // if (*root) { // if (data.dist < (*root)->elem.dist) // insert(&(*root)->left, data); // else // insert(&(*root)->right, data); // // double u = unif_rand(); // if (u < 0.33) rotateLeft(root); // else if (u < 0.67) rotateRight(root); // } // else { // *root = new BSTNode; // (*root)->left = NULL; // (*root)->right = NULL; // (*root)->elem = data; // } // } // // // public: // HclustPriorityQueue(std::size_t) { root = NULL; } // const bool empty() const { return root == NULL; } // ~HclustPriorityQueue() { deleteSubTree(&root); } // // const HeapHierarchicalItem& top() { // STOPIFNOT(root) // while (root->left) // rotateLeft(&root); // return root->elem; // } // // void pop() { // STOPIFNOT(root) // deleteLeftmost(&root); // } // // void push(const HeapHierarchicalItem& data) { // insert(&root, data); // } // // }; // class HclustPriorityQueue { // private: // // std::vector left; // std::vector right; // std::vector parent; // std::vector elem; // std::vector free; // std::size_t occupied; // std::size_t root; // std::size_t best; // // double check_sorted; // // void print(std::size_t cur, std::size_t h) { // if (cur == SIZE_MAX) return; // print(left[cur], h+1); // std::cerr << elem[cur].dist << "(" << h << "), "; // print(right[cur], h+1); // } // // void checkSorted(std::size_t cur) { // if (cur == SIZE_MAX) return; // checkSorted(right[cur]); // STOPIFNOT(check_sorted >= elem[cur].dist) // check_sorted = elem[cur].dist; // checkSorted(left[cur]); // } // // public: // // HclustPriorityQueue(std::size_t n) : // left(n), right(n), parent(n), elem(n), free(n) // { // root = SIZE_MAX; // occupied = 0; // best = SIZE_MAX; // for (std::size_t i=0; i= 0) // if (parent[best] == SIZE_MAX) { // // it's a root // best = root = right[best]; // parent[root] = SIZE_MAX; // if (best != SIZE_MAX) { // while (left[best] != SIZE_MAX) // best = left[best]; // } // } // else { // parent[best] != SIZE_MAX // STOPIFNOT(left[parent[best]] == best) // STOPIFNOT(elem[parent[best]].dist >= elem[best].dist) // STOPIFNOT(elem[root].dist >= elem[best].dist) // if (right[best] == SIZE_MAX) { // left[parent[best]] = SIZE_MAX; // best = parent[best]; // } // else { // right[best] != SIZE_MAX // left[parent[best]] = right[best]; // parent[right[best]] = parent[best]; // best = right[best]; // while (left[best] != SIZE_MAX) // best = left[best]; // } // } // } // // inline const HeapHierarchicalItem& top() { // STOPIFNOT(best != SIZE_MAX) // return elem[best]; // } // // void push(const HeapHierarchicalItem& data) { // STOPIFNOT(occupied+1 < free.size()) // if (root == SIZE_MAX) { // STOPIFNOT(occupied == 0) // root = best = free[occupied++]; // right[root] = left[root] = parent[root] = SIZE_MAX; // elem[root] = data; // return; // } // // STOPIFNOT(best != SIZE_MAX) // if (data.dist < elem[best].dist) { // STOPIFNOT(left[best] == SIZE_MAX) // left[best] = free[occupied++]; // parent[left[best]] = best; // best = left[best]; // right[best] = left[best] = SIZE_MAX; // elem[best] = data; // return; // } // // // std::size_t start = root; // while (true) { // if (data.dist < elem[start].dist) { // if (left[start] == SIZE_MAX) { // left[start] = free[occupied++]; // parent[left[start]] = start; // left[left[start]] = right[left[start]] = SIZE_MAX; // elem[left[start]] = data; // return; // } // else { // start = left[start]; // } // } // else { // if (right[start] == SIZE_MAX) { // right[start] = free[occupied++]; // parent[right[start]] = start; // left[right[start]] = right[right[start]] = SIZE_MAX; // elem[right[start]] = data; // return; // } // else { // start = right[start]; // } // } // } // } // // inline bool empty() const { return root == SIZE_MAX; } // }; class HclustPriorityQueue { size_t n; size_t ncur; std::vector items; bool heapMade; public: HclustPriorityQueue(size_t n=0) : n(n), ncur(0), items(n), heapMade(false) { } const HeapHierarchicalItem& top() { if (!heapMade) { std::make_heap(items.begin(), items.begin()+ncur); heapMade = true; } return items[0]; } void pop() { if (!heapMade) { std::make_heap(items.begin(), items.begin()+ncur); heapMade = true; } std::pop_heap(items.begin(), items.begin()+ncur); --ncur; STOPIFNOT(ncur >= 0); } void push(const HeapHierarchicalItem& item) { STOPIFNOT(ncur < n); items[ncur++] = item; if (heapMade) { std::push_heap(items.begin(), items.begin()+ncur); } } bool empty() const { return (ncur == 0); } void reset() { heapMade = false; } }; struct HClustOptions { // size_t degree; // for GNAT // size_t candidatesTimes; // for GNAT // size_t minDegree; // for GNAT // size_t maxDegree; // for GNAT // size_t maxTimesDegree; // for GNAT size_t maxLeavesElems; // size_t maxNNPrefetch; // size_t maxNNMerge; // size_t minNNPrefetch; // size_t minNNMerge; // // std::string exemplar; // bool useVpTree; bool useMST; size_t vpSelectScheme; // vp-tree and GNAT size_t vpSelectCand; // for vpSelectScheme == 1 size_t vpSelectTest; // for vpSelectScheme == 1 size_t nodesVisitedLimit;// for single approx double thresholdGini; // for single approx // size_t exemplarUpdateMethod; // exemplar - naive(0) or not naive(1)? // size_t maxExemplarLeavesElems; //for exemplars biggers numbers are needed I think // bool isCurseOfDimensionality; HClustOptions(Rcpp::RObject control); Rcpp::NumericVector toR() const; }; struct HClustStats { size_t nodeCount; // how many nodes are there in the tree size_t leafCount; // how many leaves size_t nodeVisit; // how many nodes were visited during NN search size_t nnCals; // how many times NN search job was launched size_t nnCount; // how many NNs were obtained in overall size_t medoidOldNew; //..how many times it was successful size_t medoidUpdateCount; // how many times we calculate d_old and d_new.. HClustStats(); ~HClustStats(); Rcpp::NumericVector toR() const; }; struct NNHeap { std::priority_queue< HeapNeighborItem > heap; static HClustOptions* opts; size_t exemplarsCount; // #ifdef _OPENMP // omp_lock_t lock; // #endif NNHeap() : heap(), exemplarsCount(0) { // #ifdef _OPENMP // omp_init_lock(&lock); // #endif } static void setOptions(HClustOptions* newopts) { opts = newopts; } ~NNHeap() { // #ifdef _OPENMP // omp_destroy_lock(&lock); // #endif } inline bool empty() { return heap.empty(); } inline const HeapNeighborItem& top() { return heap.top(); } inline const size_t size() { return heap.size(); } inline void pop() { heap.pop(); } inline void push(const HeapNeighborItem& elem) { heap.push(elem); } inline void insert(double index, double dist, double& maxR) { STOPIFNOT(NNHeap::opts != NULL) // #ifdef _OPENMP // omp_set_lock(&lock); // #endif if (heap.size() >= opts->maxNNPrefetch && dist < maxR) { while (!heap.empty() && heap.top().dist == maxR) { heap.pop(); } } heap.push( HeapNeighborItem(index, dist) ); if (heap.size() >= opts->maxNNPrefetch) maxR = heap.top().dist; // #ifdef _OPENMP // omp_unset_lock(&lock); // #endif } inline void insertExemplars(double index, double dist, double& maxR, DisjointSets& ds, bool isExemplar) { // #ifdef _OPENMP // omp_set_lock(&lock); // #endif heap.push( HeapNeighborItem(index, dist) ); if(isExemplar) { exemplarsCount++; } std::list toRemove; size_t toRemoveExemplarsCount=0; if (heap.size() >= opts->maxNNPrefetch+1 && dist < maxR) { while (!heap.empty() && heap.top().dist == maxR) { toRemove.push_back(heap.top()); if(heap.top().index == ds.find_set(heap.top().index)) { toRemoveExemplarsCount++; } heap.pop(); } } if(toRemoveExemplarsCount == exemplarsCount && exemplarsCount > 0) { for(auto it = toRemove.begin(); it != toRemove.end(); ++it) heap.push(*it); } else { exemplarsCount -= toRemoveExemplarsCount; } if (heap.size() >= opts->maxNNPrefetch && exemplarsCount > 0) maxR = heap.top().dist; // #ifdef _OPENMP // omp_unset_lock(&lock); // #endif } inline void fill(std::deque& nearestNeighbors) { while (!heap.empty()) { nearestNeighbors.push_front(heap.top()); heap.pop(); } } inline void fill(std::list& nearestNeighbors) { while (!heap.empty()) { nearestNeighbors.push_front(heap.top()); heap.pop(); } } inline void fill(std::priority_queue, HeapNeighborItemFromSmallestComparator>& nearestNeighbors) { while (!heap.empty()) { nearestNeighbors.push(heap.top()); heap.pop(); } } }; struct DistanceComparator { size_t index; Distance* distance; DistanceComparator(size_t index, Distance* distance) : index(index), distance(distance) {} inline bool operator()(size_t a, size_t b) { return (*distance)( index, a ) < (*distance)( index, b ); } }; struct DistanceComparatorCached { std::vector* distances; DistanceComparatorCached(std::vector* distances) : distances(distances) {} inline bool operator()(size_t a, size_t b) { return (*distances)[a] < (*distances)[b]; } }; struct IndexComparator { size_t index; IndexComparator(size_t index) : index(index) {} inline bool operator()(size_t a) { return (a <= index); } }; inline bool comparer_gt(size_t i, size_t j) { return (i>j); } inline bool comparer_gt(double i, double j) { return (i>j); } struct SortedPoint { size_t i; size_t j; SortedPoint() :i(0),j(0) {} SortedPoint(size_t _i, size_t _j) { if(_j < _i) { i = _j; j = _i; } else { i = _i; j = _j; } } inline bool operator==(const SortedPoint &other) const { return (i == other.i && j == other.j); } }; struct Point { size_t i; size_t j; Point() : i(0),j(0) {} Point(size_t _i, size_t _j) { i = _i; j = _j; } inline bool operator==(const Point &other) const { return (i == other.i && j == other.j); } }; } // namespace grup // #include namespace std { template <> struct hash { /* * template void hash_combine(size_t & seed, T const& v); * seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2); */ std::size_t operator()(const grup::Point& k) const { std::size_t seed = 0; seed ^= (size_t)(k.i) + 0x9e3779b9 + (seed << 6) + (seed >> 2); seed ^= (size_t)(k.j) + 0x9e3779b9 + (seed << 6) + (seed >> 2); // boost::hash_combine(seed, k.i); // boost::hash_combine(seed, k.j); return seed; } }; template <> struct hash { std::size_t operator()(const grup::SortedPoint& k) const { std::size_t seed = 0; seed ^= (size_t)(k.i) + 0x9e3779b9 + (seed << 6) + (seed >> 2); seed ^= (size_t)(k.j) + 0x9e3779b9 + (seed << 6) + (seed >> 2); // boost::hash_combine(seed, k.i); // boost::hash_combine(seed, k.j); return seed; } }; } // namespace std #endif genie/src/hclust2_vptree_single.cpp0000644000176200001440000003022713324042165017152 0ustar liggesusers/* ************************************************************************* * * This file is part of the `genie` package for R. * * * * Copyright 2015-2018 Marek Gagolewski, Maciej Bartoszuk, Anna Cena * * * * 'genie' is free software: you can redistribute it and/or * * modify it under the terms of the GNU General Public License * * as published by the Free Software Foundation, either version 3 * * of the License, or (at your option) any later version. * * * * 'genie' is distributed in the hope that it will be useful, * * but WITHOUT ANY WARRANTY; without even the implied warranty of * * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * * * You should have received a copy of the GNU General Public License * * along with 'genie'. If not, see . * * ************************************************************************* */ #include "hclust2_vptree_single.h" using namespace grup; // constructor (OK, we all know what this is, but I label it for faster in-code search) HClustVpTreeSingle::HClustVpTreeSingle(Distance* dist, HClustOptions* opts) : HClustNNbasedSingle(dist, opts), root(NULL) // visitAll(false) { MESSAGE_2("[%010.3f] building vp-tree\n", clock()/(float)CLOCKS_PER_SEC); std::vector distances(n); root = buildFromPoints(0, n, distances); } HClustVpTreeSingle::~HClustVpTreeSingle() { // MESSAGE_2("[%010.3f] destroying vp-tree\n", clock()/(float)CLOCKS_PER_SEC); if(root) delete root; } size_t HClustVpTreeSingle::chooseNewVantagePoint(size_t left, size_t right) { // if (opts->vpSelectScheme == 1) { // // idea by Yianilos (original vp-tree paper) // if (left + opts->vpSelectCand + opts->vpSelectTest > right) // return left; // // // randomize: // for (size_t i=left; ivpSelectCand+opts->vpSelectTest; ++i) // std::swap(indices[i], indices[i+(size_t)(unif_rand()*(right-i))]); // // // maximize variance // size_t bestIndex = -1; // double bestSigma = -INFINITY; // for (size_t i=left; ivpSelectCand; i++) { // accumulators::accumulator_set< double, // accumulators::features > acc; // for (size_t j = left+opts->vpSelectCand; j < left+opts->vpSelectCand+opts->vpSelectTest; ++j) // acc( (*distance)( indices[i], indices[j] ) ); // double curSigma = accumulators::variance(acc); // if (curSigma > bestSigma) { // bestSigma = curSigma; // bestIndex = i; // } // } // // return bestIndex; // } // else if (opts->vpSelectScheme == 2) { // idea by T. Bozkaya and M. Ozsoyoglu, "Indexing large metric spaces // for similarity search queries" // randomize: std::swap(indices[left], indices[left+(size_t)(unif_rand()*(right-left))]); // which one maximizes dist to indices[left]? size_t bestIndex = left; double bestDist = 0.0; for (size_t i=left+1; i bestDist) { bestDist = curDist; bestIndex = i; } } // for (size_t i=left+2; i bestDist) { // bestDist = curDist; // bestIndex = i; // } // } // for (size_t i=left+3; i bestDist) { // bestDist = curDist; // bestIndex = i; // } // } return bestIndex; } else { // return random index // don'use left one (even if sample seems to be randomized already, // vp in subtrees is already on the left...) return left+(size_t)(unif_rand()*(right-left)); } } HClustVpTreeSingleNode* HClustVpTreeSingle::buildFromPoints(size_t left, size_t right, std::vector& distances) { #ifdef GENERATE_STATS ++stats.nodeCount; #endif if (right - left <= opts->maxLeavesElems) { #ifdef GENERATE_STATS ++stats.leafCount; #endif HClustVpTreeSingleNode* leaf = new HClustVpTreeSingleNode(left, right); leaf->maxindex = right-1; // left < right-1 return leaf; } size_t vpi_idx = chooseNewVantagePoint(left, right); std::swap(indices[left], indices[vpi_idx]); size_t vpi = indices[left]; size_t median = (right + left) / 2; for (size_t i=left+1; i 1 time // std::nth_element(indices.begin() + left + 1, indices.begin() + median, indices.begin() + right, // DistanceComparator(vpi, distance)); // HClustVpTreeSingleNode* node = new HClustVpTreeSingleNode(vpi, left, left+1, (*distance)(vpi, indices[median])); HClustVpTreeSingleNode* node = new HClustVpTreeSingleNode(vpi, left, left+1, distances[indices[median]]); node->maxindex = left; if (median - left > 0) { // don't include vpi node->childL = buildFromPoints(left+1, median+1, distances); if (node->childL->maxindex > node->maxindex) node->maxindex = node->childL->maxindex; } if (right - median - 1 > 0) { node->childR = buildFromPoints(median+1, right, distances); if (node->childR->maxindex > node->maxindex) node->maxindex = node->childR->maxindex; } return node; } void HClustVpTreeSingle::getNearestNeighborsFromMinRadiusRecursiveLeaf( HClustVpTreeSingleNode* node, size_t index, size_t clusterIndex, double minR, std::priority_queue& bestR, double& maxR, NNHeap& nnheap) { STOPIFNOT(node->vpindex == SIZE_MAX); if (!prefetch && !node->sameCluster) { size_t commonCluster = ds.find_set(node->left); for (size_t i=node->left; iright; ++i) { size_t currentCluster = ds.find_set(i); if (currentCluster != commonCluster) commonCluster = SIZE_MAX; if (currentCluster == clusterIndex) continue; if (index >= i) continue; double dist2 = (*distance)(indices[index], indices[i]); // the slow part if (dist2 > maxR || dist2 <= minR) continue; if (dist2 < bestR.top()) { bestR.pop(); bestR.push(dist2); } nnheap.insert(i, dist2, maxR); } if (commonCluster != SIZE_MAX) node->sameCluster = true; // set to true (btw, may be true already) } else /* node->sameCluster */ { for (size_t i=node->left; iright; ++i) { if (index >= i) continue; double dist2 = (*distance)(indices[index], indices[i]); // the slow part if (dist2 > maxR || dist2 <= minR) continue; if (dist2 < bestR.top()) { bestR.pop(); bestR.push(dist2); } nnheap.insert(i, dist2, maxR); } } } void HClustVpTreeSingle::getNearestNeighborsFromMinRadiusRecursiveNonLeaf( HClustVpTreeSingleNode* node, size_t index, size_t clusterIndex, double minR, std::priority_queue& bestR, double& maxR, NNHeap& nnheap) { STOPIFNOT(node->vpindex != SIZE_MAX); // first visit the vantage point double dist = (*distance)(indices[index], indices[node->left]); // the slow part if (index < node->left && dist <= maxR && dist > minR && ds.find_set(node->left) != clusterIndex) { if (dist < bestR.top()) { bestR.pop(); bestR.push(dist); } nnheap.insert(node->left, dist, maxR); } // if (visitAll) { // if (node->childL && index < node->childL->maxindex) // getNearestNeighborsFromMinRadiusRecursive(node->childL, index, clusterIndex, minR, maxR, nnheap); // if (node->childR && index < node->childR->maxindex) // getNearestNeighborsFromMinRadiusRecursive(node->childR, index, clusterIndex, minR, maxR, nnheap); // } // else { if (dist < node->radius) { if (node->childL && index < node->childL->maxindex && dist + node->radius > minR) { // double cutR = dist - node->radius; //STOPIFNOT(maxR >= cutR); //STOPIFNOT(!(bestR.top() < cutR)); getNearestNeighborsFromMinRadiusRecursive(node->childL, index, clusterIndex, minR, bestR, maxR, nnheap); } if (node->childR && index < node->childR->maxindex) { double cutR = node->radius - dist; if (maxR >= cutR) { if (bestR.top() < cutR) { while (!nnheap.empty() && nnheap.top().dist > cutR) { nnheap.pop(); } maxR = cutR; } else getNearestNeighborsFromMinRadiusRecursive(node->childR, index, clusterIndex, minR, bestR, maxR, nnheap); } } } else /* ( dist >= node->radius ) */ { if (node->childR && index < node->childR->maxindex) { // double cutR = node->radius - dist; //STOPIFNOT(maxR >= cutR); //STOPIFNOT(!(bestR.top() < cutR)); getNearestNeighborsFromMinRadiusRecursive(node->childR, index, clusterIndex, minR, bestR, maxR, nnheap); } if (node->childL && index < node->childL->maxindex && dist + node->radius > minR) { double cutR = dist - node->radius; if (maxR >= cutR) { if (bestR.top() < cutR) { while (!nnheap.empty() && nnheap.top().dist > cutR) { nnheap.pop(); } maxR = cutR; } else getNearestNeighborsFromMinRadiusRecursive(node->childL, index, clusterIndex, minR, bestR, maxR, nnheap); } } } // } updateSameClusterFlag(node); } void HClustVpTreeSingle::updateSameClusterFlag(HClustVpTreeSingleNode* node) { if (prefetch || node->sameCluster || (node->childL && !node->childL->sameCluster) || (node->childR && !node->childR->sameCluster) ) return; // otherwise check if node->sameCluster flag needs updating size_t commonCluster = ds.find_set(node->left); if (node->childL) { size_t currentCluster = ds.find_set(node->childL->left); if (currentCluster != commonCluster) return; // not ready yet } if (node->childR) { size_t currentCluster = ds.find_set(node->childR->left); if (currentCluster != commonCluster) return; // not ready yet } node->sameCluster = true; } void HClustVpTreeSingle::print(HClustVpTreeSingleNode* node) { if (node->childL) { Rprintf("\"%llx\" -> \"%llx\" [label=\"L\"];\n", (unsigned long long)node, (unsigned long long)(node->childL)); print(node->childL); } if (node->childR) { Rprintf("\"%llx\" -> \"%llx\" [label=\"R\"];\n", (unsigned long long)node, (unsigned long long)(node->childR)); print(node->childR); } if (node->vpindex == SIZE_MAX) { for (size_t i=node->left; iright; ++i) Rprintf("\"%llx\" -> \"%llu\" [arrowhead = diamond];\n", (unsigned long long)node, (unsigned long long)indices[i]+1); } else { Rprintf("\"%llx\" [label=\"(%llu, %g)\"];\n", (unsigned long long)node, (unsigned long long)node->vpindex+1, node->radius); } } void HClustVpTreeSingle::print() { Rprintf("digraph vptree {\n"); Rprintf("size=\"6,6\";\n"); Rprintf("node [color=lightblue2, style=filled];"); print(root); Rprintf("}\n"); } genie/src/hclust2_nnbased_gini.h0000644000176200001440000000545713324042165016400 0ustar liggesusers/* ************************************************************************* * * This file is part of the `genie` package for R. * * * * Copyright 2015-2018 Marek Gagolewski, Maciej Bartoszuk, Anna Cena * * * * 'genie' is free software: you can redistribute it and/or * * modify it under the terms of the GNU General Public License * * as published by the Free Software Foundation, either version 3 * * of the License, or (at your option) any later version. * * * * 'genie' is distributed in the hope that it will be useful, * * but WITHOUT ANY WARRANTY; without even the implied warranty of * * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * * * You should have received a copy of the GNU General Public License * * along with 'genie'. If not, see . * * ************************************************************************* */ #ifndef __HCLUST2_NNBASED_GINI_H #define __HCLUST2_NNBASED_GINI_H // ************************************************************************ #include #include #include #include #include "hclust2_common.h" #include "disjoint_sets.h" #include "hclust2_result.h" namespace grup { class HClustNNbasedGini { protected: HClustOptions* opts; size_t n; Distance* distance; std::vector indices; std::vector neighborsCount; std::vector minRadiuses; std::vector shouldFind; std::vector< std::deque > nearestNeighbors; HClustStats stats; bool symmetric; PhatDisjointSets ds; bool prefetch; virtual void getNearestNeighborsFromMinRadius(size_t index, size_t clusterIndex, double minR, double& maxR, NNHeap& nnheap) = 0; HeapNeighborItem getNearestNeighbor(size_t index, double distMax=INFINITY); void prefetchNNsSymmetric(); void computePrefetch(HclustPriorityQueue& pq); void computeMerge(HclustPriorityQueue& pq, HClustResult& res); void linkAndRecomputeGini(double& lastGini, size_t s1, size_t s2); public: HClustNNbasedGini(Distance* dist, HClustOptions* opts); virtual ~HClustNNbasedGini(); HClustResult compute(); inline const HClustStats& getStats() { return stats; } inline const HClustOptions& getOptions() { return *opts; } }; // class } // namespace grup #endif genie/src/disjoint_sets.h0000644000176200001440000001065613324042165015172 0ustar liggesusers/* ************************************************************************* * * This file is part of the `genie` package for R. * * * * Copyright 2015-2018 Marek Gagolewski, Maciej Bartoszuk, Anna Cena * * * * 'genie' is free software: you can redistribute it and/or * * modify it under the terms of the GNU General Public License * * as published by the Free Software Foundation, either version 3 * * of the License, or (at your option) any later version. * * * * 'genie' is distributed in the hope that it will be useful, * * but WITHOUT ANY WARRANTY; without even the implied warranty of * * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * * * You should have received a copy of the GNU General Public License * * along with 'genie'. If not, see . * * ************************************************************************* */ #ifndef __DISJOINT_SETS_H #define __DISJOINT_SETS_H #include "defs.h" #include #include #include #include /* see defs.h */ #ifndef DISJOINT_SETS_DEBUG #define DISJOINT_SETS_DEBUG_CONST const #else #define DISJOINT_SETS_DEBUG_CONST /* const */ #endif namespace grup { class DisjointSets { private: std::vector< std::size_t > clusterParent; protected: std::size_t n; public: DisjointSets(std::size_t n); virtual ~DisjointSets(); virtual std::size_t link(std::size_t x, std::size_t y, std::size_t z); virtual std::size_t link(std::size_t x, std::size_t y); std::size_t union_set(std::size_t x, std::size_t y); inline std::size_t find_set(std::size_t x) { if (clusterParent[x] != x) return clusterParent[x] = find_set(clusterParent[x]); else return clusterParent[x]; } }; class PhatDisjointSets : public DisjointSets { private: std::vector< std::size_t > clusterSize; std::vector< std::size_t* > clusterMembers; std::vector< std::size_t > clusterNext; std::vector< std::size_t > clusterPrev; std::size_t clusterCount; std::size_t minClusterSize; std::size_t minClusterCount; void recomputeMinClusterSize(); public: PhatDisjointSets(std::size_t n); virtual ~PhatDisjointSets(); virtual std::size_t link(std::size_t x, std::size_t y); virtual std::size_t link(std::size_t x, std::size_t y, std::size_t z); inline std::size_t getClusterCount() const { return clusterCount; } inline std::size_t getMinClusterSize() const { return minClusterSize; } inline const std::size_t* getClusterMembers(std::size_t x) DISJOINT_SETS_DEBUG_CONST { #ifdef DISJOINT_SETS_DEBUG STOPIFNOT(find_set(x) == x); STOPIFNOT(clusterMembers[x]); #endif return clusterMembers[x]; } inline std::size_t getClusterSize(std::size_t x) DISJOINT_SETS_DEBUG_CONST { #ifdef DISJOINT_SETS_DEBUG STOPIFNOT(find_set(x) == x); STOPIFNOT(clusterSize[x] == 0 || clusterMembers[x] != NULL); #endif return clusterSize[x]; } inline std::size_t getClusterPrev(std::size_t x) DISJOINT_SETS_DEBUG_CONST { #ifdef DISJOINT_SETS_DEBUG STOPIFNOT(find_set(x) == x); STOPIFNOT(find_set(clusterPrev[x]) == clusterPrev[x]); STOPIFNOT(find_set(clusterNext[x]) == clusterNext[x]); #endif return clusterPrev[x]; } inline std::size_t getClusterNext(std::size_t x) DISJOINT_SETS_DEBUG_CONST { /* to iterate over all clusters starting from x, use something like: for (size_t nx = ds.getClusterNext(x); nx != x; nx = ds.getClusterNext(nx)) { // e.g.: for (auto it = ds.getClusterMembers(nx).cbegin(); it != ds.getClusterMembers(nx).cend(); ++it) // play with *it } */ #ifdef DISJOINT_SETS_DEBUG STOPIFNOT(find_set(x) == x); STOPIFNOT(find_set(clusterPrev[x]) == clusterPrev[x]); STOPIFNOT(find_set(clusterNext[x]) == clusterNext[x]); #endif return clusterNext[x]; } }; } /* namespace grup */ #endif genie/src/hclust2_nnbased_single.h0000644000176200001440000000554613324042165016732 0ustar liggesusers/* ************************************************************************* * * This file is part of the `genie` package for R. * * * * Copyright 2015-2018 Marek Gagolewski, Maciej Bartoszuk, Anna Cena * * * * 'genie' is free software: you can redistribute it and/or * * modify it under the terms of the GNU General Public License * * as published by the Free Software Foundation, either version 3 * * of the License, or (at your option) any later version. * * * * 'genie' is distributed in the hope that it will be useful, * * but WITHOUT ANY WARRANTY; without even the implied warranty of * * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * * * You should have received a copy of the GNU General Public License * * along with 'genie'. If not, see . * * ************************************************************************* */ #ifndef __HCLUST2_NNBASED_SINGLE_H #define __HCLUST2_NNBASED_SINGLE_H // ************************************************************************ #include #include #include #include #include "hclust2_common.h" #include "disjoint_sets.h" #include "hclust2_result.h" using namespace std; using namespace Rcpp; namespace grup { class HClustNNbasedSingle { protected: HClustOptions* opts; size_t n; Distance* distance; std::vector indices; std::vector neighborsCount; std::vector minRadiuses; std::vector shouldFind; HClustStats stats; #ifdef _OPENMP omp_lock_t pqwritelock; #endif DisjointSets ds; bool prefetch; virtual void getNearestNeighborsFromMinRadius(size_t index, size_t clusterIndex, double minR, NNHeap& nnheap) = 0; void getNearestNeighbors(std::priority_queue & pq, size_t index); void computePrefetch(std::priority_queue & pq); void computeMerge(std::priority_queue & pq, HClustResult& res); public: HClustNNbasedSingle(Distance* dist, HClustOptions* opts); virtual ~HClustNNbasedSingle(); virtual void print() { Rcout << "this print method is a stub" << std::endl; } HClustResult compute(bool lite=false); inline const HClustStats& getStats() { return stats; } inline const HClustOptions& getOptions() { return *opts; } }; // class } // namespace grup #endif genie/src/hclust2_common.cpp0000644000176200001440000001743313324042165015600 0ustar liggesusers/* ************************************************************************* * * This file is part of the `genie` package for R. * * * * Copyright 2015-2018 Marek Gagolewski, Maciej Bartoszuk, Anna Cena * * * * 'genie' is free software: you can redistribute it and/or * * modify it under the terms of the GNU General Public License * * as published by the Free Software Foundation, either version 3 * * of the License, or (at your option) any later version. * * * * 'genie' is distributed in the hope that it will be useful, * * but WITHOUT ANY WARRANTY; without even the implied warranty of * * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * * * You should have received a copy of the GNU General Public License * * along with 'genie'. If not, see . * * ************************************************************************* */ #include "hclust2_common.h" using namespace grup; HClustOptions* NNHeap::opts = NULL; HClustOptions::HClustOptions(Rcpp::RObject control) { maxLeavesElems = DEFAULT_MAX_LEAVES_ELEMS; maxNNPrefetch = DEFAULT_MAX_NN_PREFETCH; maxNNMerge = DEFAULT_MAX_NN_MERGE; minNNPrefetch = DEFAULT_MIN_NN_PREFETCH; minNNMerge = DEFAULT_MIN_NN_MERGE; vpSelectScheme = DEFAULT_VP_SELECT_SCHEME; vpSelectCand = DEFAULT_VP_SELECT_CAND; vpSelectTest = DEFAULT_VP_SELECT_TEST; nodesVisitedLimit = DEFAULT_NODES_VISITED_LIMIT; thresholdGini = DEFAULT_THRESHOLD_GINI; useVpTree = DEFAULT_USEVPTREE; useMST = DEFAULT_USEMST; if (!Rf_isNull((SEXP)control)) { Rcpp::List control2(control); // if (control2.containsElementNamed("exemplar")) { // Rcpp::CharacterVector ex = Rcpp::clone(Rcpp::as(control2["exemplar"])); // const char* exc = CHAR(STRING_ELT((SEXP)ex, 0)); // exemplar = std::string(exc); // //Rcpp::CharacterVector exemplar = Rcpp::CharacterVector(control2["exemplar"]); // } if (control2.containsElementNamed("maxLeavesElems")) { maxLeavesElems = (size_t)Rcpp::as(control2["maxLeavesElems"])[0]; } if (control2.containsElementNamed("maxNNPrefetch")) { maxNNPrefetch = (size_t)Rcpp::as(control2["maxNNPrefetch"])[0]; } if (control2.containsElementNamed("maxNNMerge")) { maxNNMerge = (size_t)Rcpp::as(control2["maxNNMerge"])[0]; } if (control2.containsElementNamed("minNNPrefetch")) { minNNPrefetch = (size_t)Rcpp::as(control2["minNNPrefetch"])[0]; } if (control2.containsElementNamed("minNNMerge")) { minNNMerge = (size_t)Rcpp::as(control2["minNNMerge"])[0]; } if (control2.containsElementNamed("vpSelectScheme")) { vpSelectScheme = (size_t)Rcpp::as(control2["vpSelectScheme"])[0]; } if (control2.containsElementNamed("vpSelectCand")) { vpSelectCand = (size_t)Rcpp::as(control2["vpSelectCand"])[0]; } if (control2.containsElementNamed("vpSelectTest")) { vpSelectTest = (size_t)Rcpp::as(control2["vpSelectTest"])[0]; } if (control2.containsElementNamed("nodesVisitedLimit")) { nodesVisitedLimit = (size_t)Rcpp::as(control2["nodesVisitedLimit"])[0]; } if (control2.containsElementNamed("thresholdGini")) { thresholdGini = (double)Rcpp::as(control2["thresholdGini"])[0]; } if (control2.containsElementNamed("useVpTree")) { useVpTree = (bool)Rcpp::as(control2["useVpTree"])[0]; } if (control2.containsElementNamed("useMST")) { useMST = (bool)Rcpp::as(control2["useMST"])[0]; } } if (thresholdGini < 0.0 || thresholdGini > 1.0) { thresholdGini = DEFAULT_THRESHOLD_GINI; Rf_warning("wrong thresholdGini value. using default"); } if (maxLeavesElems < 2 || maxLeavesElems > 512) { maxLeavesElems = DEFAULT_MAX_LEAVES_ELEMS; Rf_warning("wrong maxLeavesElems value. using default"); } if (maxNNPrefetch < 1) { maxNNPrefetch = DEFAULT_MAX_NN_PREFETCH; Rf_warning("wrong maxNNPrefetch value. using default"); } if (maxNNMerge < 1) { STOPIFNOT(maxNNPrefetch >= 1 && maxNNPrefetch != SIZE_MAX); maxNNMerge = DEFAULT_MAX_NN_MERGE; Rf_warning("wrong maxNNMerge value. using default"); } if (minNNPrefetch < 1) { minNNPrefetch = DEFAULT_MIN_NN_PREFETCH; Rf_warning("wrong minNNPrefetch value. using default"); } if (minNNMerge < 1) { STOPIFNOT(minNNPrefetch >= 1 && minNNPrefetch != SIZE_MAX); minNNMerge = DEFAULT_MIN_NN_MERGE; Rf_warning("wrong minNNMerge value. using default"); } if (vpSelectScheme < 1 || vpSelectScheme > 3) { vpSelectScheme = DEFAULT_VP_SELECT_SCHEME; Rf_warning("wrong vpSelectScheme value. using default"); } if (vpSelectCand < 1 || vpSelectCand > 128) { vpSelectCand = DEFAULT_VP_SELECT_CAND; Rf_warning("wrong vpSelectCand value. using default"); } if (vpSelectTest < 1 || vpSelectTest > 128) { vpSelectTest = DEFAULT_VP_SELECT_TEST; Rf_warning("wrong vpSelectTest value. using default"); } if (nodesVisitedLimit < 1) { nodesVisitedLimit = DEFAULT_NODES_VISITED_LIMIT; Rf_warning("wrong nodesVisitedLimit value. using default"); } } Rcpp::NumericVector HClustOptions::toR() const { return Rcpp::NumericVector::create( Rcpp::_["maxLeavesElems"] = maxLeavesElems, Rcpp::_["maxNNPrefetch"] = maxNNPrefetch, Rcpp::_["maxNNMerge"] = maxNNMerge, Rcpp::_["minNNPrefetch"] = minNNPrefetch, Rcpp::_["minNNMerge"] = minNNMerge, Rcpp::_["vpSelectScheme"] = vpSelectScheme, Rcpp::_["vpSelectCand"] = vpSelectCand, Rcpp::_["vpSelectTest"] = vpSelectTest, Rcpp::_["nodesVisitedLimit"] = nodesVisitedLimit, Rcpp::_["thresholdGini"] = thresholdGini, Rcpp::_["useVpTree"] = useVpTree, Rcpp::_["useMST"] = useMST ); } HClustStats::HClustStats() : nodeCount(0), leafCount(0), nodeVisit(0), nnCals(0), nnCount(0), medoidOldNew(0), medoidUpdateCount(0) {} HClustStats::~HClustStats() { #if VERBOSE > 0 Rprintf(" vp-tree: nodeCount=%.0f, leafCount=%.0f, nodeVisit=%.0f, nnCals=%.0f, nnCount=%.0f, medoidUpdateCount=%.0f, medoidOldNew=%.0f\n", (double)nodeCount, (double)leafCount, (double)nodeVisit, (double)nnCals, (double)nnCount,(double)medoidUpdateCount, (double)medoidOldNew); #endif } Rcpp::NumericVector HClustStats::toR() const { return Rcpp::NumericVector::create( Rcpp::_["nodeCount"] = (nodeCount>0)?(double)nodeCount:NA_REAL, Rcpp::_["leafCount"] = (leafCount>0)?(double)leafCount:NA_REAL, Rcpp::_["nodeVisit"] = (nodeVisit>0)?(double)nodeVisit:NA_REAL, Rcpp::_["nnCals"] = (nnCals>0)?(double)nnCals:NA_REAL, Rcpp::_["nnCount"] = (nnCount>0)?(double)nnCount:NA_REAL, Rcpp::_["medoidUpdateCount"] = (medoidUpdateCount>0)?medoidUpdateCount:NA_REAL, Rcpp::_["medoidOldNew"] = (medoidOldNew>0)?medoidOldNew:NA_REAL ); } genie/src/init.cpp0000644000176200001440000000122513324042165013577 0ustar liggesusers// #include // #include // #include // #include // #include // // // extern "C" SEXP _genie_hclust2_gini(SEXP distanceSEXP, SEXP objectsSEXP, SEXP controlSEXP); // // static const R_CallMethodDef cCallMethods[] = { // {"genie_hclust2_gini", (DL_FUNC) &_genie_hclust2_gini, 3}, // {NULL, NULL, 0} // }; // // extern "C" void R_init_genie(DllInfo *dll) // { // R_registerRoutines(dll, NULL, cCallMethods, NULL, NULL); // R_useDynamicSymbols(dll, (Rboolean)FALSE); // #if defined(R_VERSION) && R_VERSION >= R_Version(3, 0, 0) // R_forceSymbols(dll, (Rboolean)TRUE); // #endif // } // genie/src/hclust2_result.h0000644000176200001440000000460713324042165015272 0ustar liggesusers/* ************************************************************************* * * This file is part of the `genie` package for R. * * * * Copyright 2015-2018 Marek Gagolewski, Maciej Bartoszuk, Anna Cena * * * * 'genie' is free software: you can redistribute it and/or * * modify it under the terms of the GNU General Public License * * as published by the Free Software Foundation, either version 3 * * of the License, or (at your option) any later version. * * * * 'genie' is distributed in the hope that it will be useful, * * but WITHOUT ANY WARRANTY; without even the implied warranty of * * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * * * You should have received a copy of the GNU General Public License * * along with 'genie'. If not, see . * * ************************************************************************* */ #ifndef __HCLUST2_RESULT_H #define __HCLUST2_RESULT_H #include "defs.h" #include "disjoint_sets.h" #include "hclust2_common.h" #include "hclust2_distance.h" #include namespace grup { class HClustResult { private: size_t curiter; size_t n; Rcpp::NumericMatrix links; Rcpp::NumericMatrix merge; Rcpp::NumericVector height; Rcpp::NumericVector order; // call is set by R // method is set by R Rcpp::RObject labels; Rcpp::RObject dist_method; bool lite; void generateMergeMatrix(); void generateOrderVector(); public: HClustResult(size_t n, Distance* dist, bool lite=false); Rcpp::NumericMatrix getLinks() { return links; } Rcpp::NumericVector getHeight() { return height; } void link(size_t i1, size_t i2, double d12); Rcpp::List toR( const grup::HClustStats& hclustStats, const grup::HClustOptions& hclustOptions, const grup::DistanceStats& distStats ); }; // struct HClustResult } // namespace grup #endif genie/src/hclust2_rcpp_gini.cpp0000644000176200001440000000476713324042165016270 0ustar liggesusers/* ************************************************************************* * * This file is part of the `genie` package for R. * * * * Copyright 2015-2018 Marek Gagolewski, Maciej Bartoszuk, Anna Cena * * * * 'genie' is free software: you can redistribute it and/or * * modify it under the terms of the GNU General Public License * * as published by the Free Software Foundation, either version 3 * * of the License, or (at your option) any later version. * * * * 'genie' is distributed in the hope that it will be useful, * * but WITHOUT ANY WARRANTY; without even the implied warranty of * * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * * * You should have received a copy of the GNU General Public License * * along with 'genie'. If not, see . * * ************************************************************************* */ #include "hclust2_vptree_gini.h" #include "hclust2_mstbased_gini.h" using namespace Rcpp; // [[Rcpp::export(".hclust2_gini")]] RObject hclust2_gini(RObject distance, RObject objects, RObject control=R_NilValue) { MESSAGE_2("[%010.3f] starting timer\n", clock()/(double)CLOCKS_PER_SEC); Rcpp::RObject result(R_NilValue); grup::Distance* dist = grup::Distance::createDistance(distance, objects, control); try { /* Rcpp::checkUserInterrupt(); may throw an exception */ grup::HClustOptions opts(control); grup::NNHeap::setOptions(&opts); grup::HClustMSTbasedGini hclust(dist, &opts); grup::HClustResult result2 = grup::HClustMSTbasedGini(dist, &opts).compute(); result = Rcpp::as( result2.toR(hclust.getStats(), hclust.getOptions(), dist->getStats()) ); } catch(...) { // do nothing yet } #if VERBOSE > 0 dist->getStats().print(); #endif if (dist) delete dist; MESSAGE_2("[%010.3f] done\n", clock()/(double)CLOCKS_PER_SEC); if (Rf_isNull(result)) Rcpp::stop("stopping on error or explicit user interrupt"); return result; } genie/src/hclust2_nnbased_single.cpp0000644000176200001440000001500613324042165017255 0ustar liggesusers/* ************************************************************************* * * This file is part of the `genie` package for R. * * * * Copyright 2015-2018 Marek Gagolewski, Maciej Bartoszuk, Anna Cena * * * * 'genie' is free software: you can redistribute it and/or * * modify it under the terms of the GNU General Public License * * as published by the Free Software Foundation, either version 3 * * of the License, or (at your option) any later version. * * * * 'genie' is distributed in the hope that it will be useful, * * but WITHOUT ANY WARRANTY; without even the implied warranty of * * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * * * You should have received a copy of the GNU General Public License * * along with 'genie'. If not, see . * * ************************************************************************* */ #include "hclust2_nnbased_single.h" using namespace grup; // constructor (OK, we all know what this is, but I label it for faster in-code search) HClustNNbasedSingle::HClustNNbasedSingle(Distance* dist, HClustOptions* opts) : opts(opts), n(dist->getObjectCount()), distance(dist), indices(dist->getObjectCount()), neighborsCount(dist->getObjectCount(), 0), minRadiuses(dist->getObjectCount(), -INFINITY), shouldFind(dist->getObjectCount(), true), #ifdef GENERATE_STATS stats(), #endif ds(dist->getObjectCount()) { // starting indices: random permutation of {0,1,...,_n-1} for (size_t i=0;i= 1; i--) swap(indices[i], indices[(size_t)(unif_rand()*(i+1))]); #ifdef _OPENMP omp_init_lock(&pqwritelock); #endif } HClustNNbasedSingle::~HClustNNbasedSingle() { #ifdef _OPENMP omp_destroy_lock(&pqwritelock); #endif } void HClustNNbasedSingle::getNearestNeighbors( std::priority_queue< HeapHierarchicalItem > & pq, size_t index) { if (!shouldFind[index]) return; size_t clusterIndex = ds.find_set(index); #ifdef GENERATE_STATS #ifdef _OPENMP #pragma omp atomic #endif ++stats.nnCals; #endif NNHeap nnheap; getNearestNeighborsFromMinRadius(index, clusterIndex, minRadiuses[index], nnheap); size_t newNeighborsCount = 0.0; #ifdef _OPENMP omp_set_lock(&pqwritelock); #endif while (!nnheap.empty()) { if (isfinite(nnheap.top().dist) && nnheap.top().index != SIZE_MAX) { ++newNeighborsCount; pq.push(HeapHierarchicalItem(index, nnheap.top().index, nnheap.top().dist)); minRadiuses[index] = std::max(minRadiuses[index], nnheap.top().dist); } nnheap.pop(); } neighborsCount[index] += newNeighborsCount; #ifdef GENERATE_STATS stats.nnCount += newNeighborsCount; #endif if (neighborsCount[index] > n - index || newNeighborsCount == 0) shouldFind[index] = false; else { pq.push(HeapHierarchicalItem(index, SIZE_MAX, minRadiuses[index])); // to be continued... } #ifdef _OPENMP omp_unset_lock(&pqwritelock); #endif } void HClustNNbasedSingle::computePrefetch(std::priority_queue< HeapHierarchicalItem > & pq) { // INIT: Pre-fetch a few nearest neighbors for each point MESSAGE_2("[%010.3f] prefetching NNs\n", clock()/(float)CLOCKS_PER_SEC); #ifdef _OPENMP omp_set_dynamic(0); /* the runtime will not dynamically adjust the number of threads */ #pragma omp parallel for schedule(dynamic) #endif for (size_t i=0; i & pq, HClustResult& res) { MESSAGE_2("[%010.3f] merging clusters\n", clock()/(float)CLOCKS_PER_SEC); volatile bool go=true; volatile size_t i = 0; #ifdef _OPENMP #pragma omp parallel #endif while (go) { #ifdef _OPENMP omp_set_lock(&pqwritelock); #endif STOPIFNOT(!pq.empty()) HeapHierarchicalItem hhi = pq.top(); if (hhi.index2 == SIZE_MAX) { pq.pop(); #ifdef _OPENMP omp_unset_lock(&pqwritelock); #endif getNearestNeighbors(pq, hhi.index1); continue; } size_t s1 = ds.find_set(hhi.index1); size_t s2 = ds.find_set(hhi.index2); if (s1 == s2) { pq.pop(); #ifdef _OPENMP omp_unset_lock(&pqwritelock); #endif continue; } #ifdef _OPENMP omp_unset_lock(&pqwritelock); //different threads will be unable to put data into pq without it #pragma omp barrier #pragma omp single #endif { hhi = pq.top(); //it can change, because other threads can push something pq.pop(); s1 = ds.find_set(hhi.index1); s2 = ds.find_set(hhi.index2); STOPIFNOT(s1 != s2); STOPIFNOT(s2 != SIZE_MAX); STOPIFNOT(hhi.index1 < hhi.index2); res.link(indices[hhi.index1], indices[hhi.index2], hhi.dist); ds.link(s1, s2); ++i; if (i == n-1) go = false;/* avoids computing unnecessary nn */ } // #pragma omp single if (MASTER_OR_SINGLE_THREAD) { if (i % 512 == 0) MESSAGE_7("\r merge clusters: %d / %d", i+1, n-1); Rcpp::checkUserInterrupt(); // may throw an exception, fast op, not thread safe } } MESSAGE_7("\r merge clusters: %d / %d \n", n-1, n-1); Rcpp::checkUserInterrupt(); } HClustResult HClustNNbasedSingle::compute(bool lite) { std::priority_queue< HeapHierarchicalItem > pq; // HclustPriorityQueue pq(n); HClustResult res(n, distance, lite); #if VERBOSE >= 5 distance->getStats().print(); #endif prefetch = true; computePrefetch(pq); prefetch = false; #if VERBOSE >= 5 distance->getStats().print(); #endif computeMerge(pq, res); return res; } genie/src/Makevars0000644000176200001440000000013413241042746013625 0ustar liggesusersCXX_STD = CXX11 PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS) genie/src/hclust2_distance.cpp0000644000176200001440000005056113324042165016101 0ustar liggesusers/* ************************************************************************* * * This file is part of the `genie` package for R. * * * * Copyright 2015-2018 Marek Gagolewski, Maciej Bartoszuk, Anna Cena * * * * 'genie' is free software: you can redistribute it and/or * * modify it under the terms of the GNU General Public License * * as published by the Free Software Foundation, either version 3 * * of the License, or (at your option) any later version. * * * * 'genie' is distributed in the hope that it will be useful, * * but WITHOUT ANY WARRANTY; without even the implied warranty of * * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * * * You should have received a copy of the GNU General Public License * * along with 'genie'. If not, see . * * ************************************************************************* */ #include #include "hclust2_distance.h" using namespace grup; // ------------------------------------------------------------------------ #ifdef MEASURE_MEM_USE /* * Author: David Robert Nadeau * Site: http://NadeauSoftware.com/ * License: Creative Commons Attribution 3.0 Unported License * http://creativecommons.org/licenses/by/3.0/deed.en_US * * * http://stackoverflow.com/questions/669438/how-to-get-memory-usage-at-run-time-in-c */ #if defined(_WIN32) #include #include #elif defined(__unix__) || defined(__unix) || defined(unix) || (defined(__APPLE__) && defined(__MACH__)) #include #include #if defined(__APPLE__) && defined(__MACH__) #include #elif (defined(_AIX) || defined(__TOS__AIX__)) || (defined(__sun__) || defined(__sun) || defined(sun) && (defined(__SVR4) || defined(__svr4__))) #include #include #elif defined(__linux__) || defined(__linux) || defined(linux) || defined(__gnu_linux__) #include #endif #else #error "Cannot define getPeakRSS( ) or getCurrentRSS( ) for an unknown OS." #endif /** * Returns the peak (maximum so far) resident set size (physical * memory use) measured in bytes, or zero if the value cannot be * determined on this OS. */ size_t getPeakRSS( ) { #if defined(_WIN32) /* Windows -------------------------------------------------- */ PROCESS_MEMORY_COUNTERS info; GetProcessMemoryInfo( GetCurrentProcess( ), &info, sizeof(info) ); return (size_t)info.PeakWorkingSetSize; #elif (defined(_AIX) || defined(__TOS__AIX__)) || (defined(__sun__) || defined(__sun) || defined(sun) && (defined(__SVR4) || defined(__svr4__))) /* AIX and Solaris ------------------------------------------ */ struct psinfo psinfo; int fd = -1; if ( (fd = open( "/proc/self/psinfo", O_RDONLY )) == -1 ) return (size_t)0L; /* Can't open? */ if ( read( fd, &psinfo, sizeof(psinfo) ) != sizeof(psinfo) ) { close( fd ); return (size_t)0L; /* Can't read? */ } close( fd ); return (size_t)(psinfo.pr_rssize * 1024L); #elif defined(__unix__) || defined(__unix) || defined(unix) || (defined(__APPLE__) && defined(__MACH__)) /* BSD, Linux, and OSX -------------------------------------- */ struct rusage rusage; getrusage( RUSAGE_SELF, &rusage ); #if defined(__APPLE__) && defined(__MACH__) return (size_t)rusage.ru_maxrss; #else return (size_t)(rusage.ru_maxrss * 1024L); #endif #else /* Unknown OS ----------------------------------------------- */ return (size_t)0L; /* Unsupported. */ #endif } /** * Returns the current resident set size (physical memory use) measured * in bytes, or zero if the value cannot be determined on this OS. */ size_t getCurrentRSS( ) { #if defined(_WIN32) /* Windows -------------------------------------------------- */ PROCESS_MEMORY_COUNTERS info; GetProcessMemoryInfo( GetCurrentProcess( ), &info, sizeof(info) ); return (size_t)info.WorkingSetSize; #elif defined(__APPLE__) && defined(__MACH__) /* OSX ------------------------------------------------------ */ struct mach_task_basic_info info; mach_msg_type_number_t infoCount = MACH_TASK_BASIC_INFO_COUNT; if ( task_info( mach_task_self( ), MACH_TASK_BASIC_INFO, (task_info_t)&info, &infoCount ) != KERN_SUCCESS ) return (size_t)0L; /* Can't access? */ return (size_t)info.resident_size; #elif defined(__linux__) || defined(__linux) || defined(linux) || defined(__gnu_linux__) /* Linux ---------------------------------------------------- */ long rss = 0L; FILE* fp = NULL; if ( (fp = fopen( "/proc/self/statm", "r" )) == NULL ) return (size_t)0L; /* Can't open? */ if ( fscanf( fp, "%*s%ld", &rss ) != 1 ) { fclose( fp ); return (size_t)0L; /* Can't read? */ } fclose( fp ); return (size_t)rss * (size_t)sysconf( _SC_PAGESIZE); #else /* AIX, BSD, Solaris, and Unknown OS ------------------------ */ return (size_t)0L; /* Unsupported. */ #endif } #endif // ------------------------------------------------------------------------ void DistanceStats::print() const { #if VERBOSE > 0 // #if defined(HASHMAP_ENABLED) && defined(GENERATE_STATS) // Rprintf(" distance function hashmap #hits: %.0f, #miss: %.0f, est.mem.used: ~%.1fMB (vs %.1fMB)\n", // (double)hashmapHit, (double)hashmapMiss, // 8.0f*hashmapMiss/1000.0f/1000.0f, // 8.0f*distCallTheoretical/1000.0f/1000.0f); // #endif #if defined(GENERATE_STATS) Rprintf(" distance function total calls: %.0f (i.e., %.2f%% of %.0f)\n", (double)distCallCount, (double)distCallCount*100.0/(double)distCallTheoretical, (double)distCallTheoretical ); #if defined(MEASURE_MEM_USE) Rprintf(" currentRSS=%.0f MB, peakRSS=%.0f MB\n", (double)getCurrentRSS()/1000.0/1000.0, (double)getPeakRSS()/1000.0/1000.0 ); #endif #endif #endif } Distance::Distance(size_t n) : // #ifdef HASHMAP_ENABLED // hashmap(std::vector< std::unordered_map >(n)), // #endif stats(DistanceStats(n)), n(n) { // #ifdef HASHMAP_ENABLED // MESSAGE_1("Warning: HASHMAP_ENABLED is defined in hclust2_distance.h\n"); // #endif #ifdef GENERATE_STATS MESSAGE_1("Warning: GENERATE_STATS is defined in hclust2_distance.h\n"); #endif } Distance::~Distance() { // #if VERBOSE > 5 // Rprintf("[%010.3f] destroying distance object (base)\n", clock()/(float)CLOCKS_PER_SEC); // #endif } // #ifdef HASHMAP_ENABLED // double Distance::operator()(size_t v1, size_t v2) // { // if (v1 == v2) return 0.0; // if (v1 > v2) std::swap(v1, v2); // // #ifdef GENERATE_STATS // #ifdef _OPENMP // #pragma omp atomic // #endif // ++stats.distCallCount; // #endif // // // this is thread unsafe, but we use it only for testing: // auto got = hashmap[v1].find(v2); // if ( got == hashmap[v1].end() ) // { // #ifdef GENERATE_STATS // #ifdef _OPENMP // #pragma omp atomic // #endif // ++stats.hashmapMiss; // #endif // double d = compute(v1, v2); // hashmap[v1].emplace(v2, d); // return d; // } // else // { // #ifdef GENERATE_STATS // #ifdef _OPENMP // #pragma omp atomic // #endif // ++stats.hashmapHit; // #endif // return got->second; // } // } // #endif Distance* Distance::createDistance(Rcpp::RObject distance, Rcpp::RObject objects, Rcpp::RObject control) { if (Rf_isVectorList(objects) && Rf_isFunction(distance)) { Rcpp::Function distance2(distance); Rcpp::List objects2(objects); return (grup::Distance*) new grup::GenericRDistance( distance2, objects2 ); } else if (Rf_isNumeric(distance) && Rf_isObject(distance) && !strcmp(distance.attr("class"), "dist") && Rf_isNull(objects)) { return (grup::Distance*) new grup::DistObjectDistance( (Rcpp::NumericVector)distance ); } else if (Rf_isVectorList(objects) && (Rf_isNull(distance) || Rf_isString(distance))) { Rcpp::List objects2(objects); Rcpp::CharacterVector distance2 = ((Rf_isNull(distance))?Rcpp::CharacterVector("levenshtein"):Rcpp::CharacterVector(distance)); const char* distance3 = CHAR(STRING_ELT((SEXP)distance2, 0)); if (!strcmp(distance3, "levenshtein")) { return (grup::Distance*)new grup::LevenshteinDistanceInt(objects2); } else if (!strcmp(distance3, "dinu")) { return (grup::Distance*)new grup::DinuDistanceInt(objects2); } else if (!strcmp(distance3, "hamming")) { return (grup::Distance*)new grup::HammingDistanceInt(objects2); } else if (!strcmp(distance3, "euclinf")) { Rcpp::List control2(control); double p, r; if (control2.containsElementNamed("p")) p = (size_t)Rcpp::as(control2["p"])[0]; else Rcpp::stop("In euclinf p should be given."); if (control2.containsElementNamed("r")) r = (size_t)Rcpp::as(control2["r"])[0]; else Rcpp::stop("In euclinf r should be given."); return (grup::Distance*)new grup::Euclinf(objects2, p, r); } else { Rcpp::stop("`distance` should be one of: \"levenshtein\" (default), \"dinu\", \"hamming\", \"euclinf\""); } } else if (Rf_isString(objects) && (Rf_isNull(distance) || Rf_isString(distance))) { Rcpp::CharacterVector objects2(objects); Rcpp::CharacterVector distance2 = ((Rf_isNull(distance))?Rcpp::CharacterVector("levenshtein"):Rcpp::CharacterVector(distance)); const char* distance3 = CHAR(STRING_ELT((SEXP)distance2, 0)); if (!strcmp(distance3, "levenshtein")) { return (grup::Distance*)new grup::LevenshteinDistanceChar(objects2); } else if (!strcmp(distance3, "dinu")) { return (grup::Distance*)new grup::DinuDistanceChar(objects2); } else if (!strcmp(distance3, "hamming")) { return (grup::Distance*)new grup::HammingDistanceChar(objects2); } else { Rcpp::stop("`distance` should be one of: \"levenshtein\" (default), \"dinu\", \"hamming\""); } } else if (Rf_isMatrix(objects) && Rf_isNumeric(objects) && (Rf_isNull(distance) || Rf_isString(distance))) { Rcpp::NumericMatrix objects2(objects); Rcpp::CharacterVector distance2 = ((Rf_isNull(distance))?Rcpp::CharacterVector("euclidean_squared"):Rcpp::CharacterVector(distance)); const char* distance3 = CHAR(STRING_ELT((SEXP)distance2, 0)); if (!strcmp(distance3, "euclidean_squared")) { return (grup::Distance*) new grup::SquaredEuclideanDistance( objects2 ); } else if (!strcmp(distance3, "euclidean")) { return (grup::Distance*) new grup::EuclideanDistance( objects2 ); } else if (!strcmp(distance3, "manhattan")) { return (grup::Distance*) new grup::ManhattanDistance( objects2 ); } else if (!strcmp(distance3, "maximum")) { return (grup::Distance*) new grup::MaximumDistance( objects2 ); } else if (!strcmp(distance3, "hamming")) { return (grup::Distance*) new grup::HammingDistance( objects2 ); } else { Rcpp::stop("`distance` should be one of: \"euclidean_squared\" (default), \"euclidean\", \"manhattan\", \"maximum\", \"hamming\""); } } else { Rcpp::stop("incorrect input data"); } return NULL; } GenericMatrixDistance::GenericMatrixDistance(const Rcpp::NumericMatrix& points) : Distance(points.nrow()), items(REAL((SEXP)points)), m(points.ncol()) { // act on a transposed matrix to avoid many L1/L... cache misses items = new double[m*n]; const double* items2 = REAL((SEXP)points); double* items_ptr = items; for (size_t i=0; i d) d = d2; } return d; } double HammingDistance::compute(size_t v1, size_t v2) { if (v1 == v2) return 0.0; double d = 0.0; for (size_t i=0; i double distance_levenshtein(const T* s1, const T* s2, size_t n1, size_t n2) { if (n1 < n2) { std::swap(s1, s2); // pointer swap std::swap(n1, n2); } // #ifdef _OPENMP // to be thread-safe, we have to allocate these 2 arrays each time... size_t* v_cur = new size_t[n2+1]; size_t* v_last = new size_t[n2+1]; // #endif // n2 <= n1 for (size_t j=0; j<=n2; ++j) v_cur[j] = j; for (size_t i=1; i<=n1; ++i) { std::swap(v_last, v_cur); // pointer swap v_cur[0] = i; for (size_t j=1; j<=n2; ++j) { if (s1[i-1] == s2[j-1]) v_cur[j] = v_last[j-1]; else v_cur[j] = std::min(std::min( v_last[j-1]+1, v_cur[j-1]+1), v_last[j]+1); } } double ret = (double) v_cur[n2]; // #ifdef _OPENMP delete [] v_cur; delete [] v_last; // #endif return ret; } double LevenshteinDistanceInt::compute(size_t v1, size_t v2) { return distance_levenshtein(items[v1], items[v2], lengths[v1], lengths[v2]); } double LevenshteinDistanceChar::compute(size_t v1, size_t v2) { return distance_levenshtein(items[v1], items[v2], lengths[v1], lengths[v2]); } // -------------------------------------------------------------------------------------------- template double distance_hamming(const T* s1, const T* s2, size_t n1, size_t n2) { if (n1 != n2) Rcpp::stop("objects should be of the same dimension"); double d = 0.0; for (size_t i=0; i double distance_dinu(const T* x, const T* y, const size_t* ox, const size_t* oy, size_t nx, size_t ny) { double d = 0.0; size_t ix = 0, iy = 0; while (ix < nx && iy < ny) { if (x[ox[ix]] == y[oy[iy]]) d += std::abs((ox[ix++]+1.0) - (oy[iy++]+1.0)); else if (x[ox[ix]] < y[oy[iy]]) d += std::abs((ox[ix++]+1.0) - 0.0); else d += std::abs(0.0 - (oy[iy++]+1.0)); } while (ix < nx) d += std::abs((ox[ix++]+1.0) - 0.0); while (iy < ny) d += std::abs(0.0 - (oy[iy++]+1.0)); return d; } double DinuDistanceInt::compute(size_t v1, size_t v2) { const int* x = items[v1]; const int* y = items[v2]; const size_t* ox = ranks[v1].data(); const size_t* oy = ranks[v2].data(); size_t nx = lengths[v1]; size_t ny = lengths[v2]; return distance_dinu(x, y, ox, oy, nx, ny); } double DinuDistanceChar::compute(size_t v1, size_t v2) { const char* x = items[v1]; const char* y = items[v2]; const size_t* ox = ranks[v1].data(); const size_t* oy = ranks[v2].data(); size_t nx = lengths[v1]; size_t ny = lengths[v2]; return distance_dinu(x, y, ox, oy, nx, ny); } double Euclinf::compute(size_t v1, size_t v2) { const double* x = items[v1]; const double* y = items[v2]; size_t nx = lengths[v1]; size_t ny = lengths[v2]; double dist = 0.0; std::size_t min_nx_ny = std::min(nx, ny); for (std::size_t i=0; i. * * ************************************************************************* */ #ifndef __HCLUST2_DISTANCE_H #define __HCLUST2_DISTANCE_H #include "defs.h" /* add string dists = lcs, dam-lev numeric -> metric: binary (see dist) minkowski (p), canberra allow external ptr distance: double dist(const char* s1, int nx, const char* s2, int ny) double dist(const double* s1, int nx, const double* s2, int ny) double dist(const int* s1, int nx, const int* s2, int ny) double dist(SEXP s1, SEXP s2) use cases: objects 1:n, distance(i,j) -> ith, jth row of a data frame (check namespaces... - call within an R function) test for proper NA handling in Matrix and String distance GenericRDistance, DistObjectDistance -- how to handle NAs?? */ #include #include #include #include #include namespace grup { struct DistanceStats { // size_t hashmapHit; // size_t hashmapMiss; size_t distCallCount; size_t distCallTheoretical; DistanceStats(size_t n) : // hashmapHit(0), hashmapMiss(0), distCallCount(0), distCallTheoretical(n*(n-1)/2) {} void print() const; Rcpp::NumericVector toR() const { return Rcpp::NumericVector::create( // Rcpp::_["hashmapHit"] // = (hashmapHit>0)?(double)hashmapHit:NA_REAL, // Rcpp::_["hashmapMiss"] // = (hashmapMiss>0)?(double)hashmapMiss:NA_REAL, Rcpp::_["distCallCount"] = (distCallCount>0)?(double)distCallCount:NA_REAL, Rcpp::_["distCallTheoretical"] = (distCallTheoretical>0)?(double)distCallTheoretical:NA_REAL ); } }; class Distance { private: #ifdef HASHMAP_ENABLED std::vector< std::unordered_map > hashmap; #endif DistanceStats stats; protected: size_t n; virtual double compute(size_t v1, size_t v2) = 0; public: Distance(size_t n); virtual ~Distance(); inline size_t getObjectCount() { return n; } static Distance* createDistance(Rcpp::RObject distance, Rcpp::RObject objects, Rcpp::RObject control=R_NilValue); virtual Rcpp::RObject getLabels() { /* stub */ return R_NilValue; } virtual Rcpp::RObject getDistMethod() { /* stub */ return R_NilValue; } inline const DistanceStats& getStats() { return stats; } #ifdef HASHMAP_ENABLED double operator()(size_t v1, size_t v2); #else inline double operator()(size_t v1, size_t v2) { #ifdef GENERATE_STATS #ifdef _OPENMP #pragma omp atomic #endif ++stats.distCallCount; #endif return compute(v1, v2); } #endif }; class GenericMatrixDistance : public Distance { protected: double* items; size_t m; public: // TO DO: virtual Rcpp::RObject getLabels() { /* stub */ return R_NilValue; } --- get row names GenericMatrixDistance(const Rcpp::NumericMatrix& points); virtual ~GenericMatrixDistance() { // #if VERBOSE > 5 // Rprintf("[%010.3f] destroying distance object\n", clock()/(float)CLOCKS_PER_SEC); // #endif delete [] items; } }; class SquaredEuclideanDistance : public GenericMatrixDistance { protected: virtual double compute(size_t v1, size_t v2); public: virtual Rcpp::RObject getDistMethod() { return Rf_mkString("euclidean_squared"); } SquaredEuclideanDistance(const Rcpp::NumericMatrix& points) : GenericMatrixDistance(points) { } }; class EuclideanDistance : public GenericMatrixDistance { // private: // std::vector sqobs; protected: virtual double compute(size_t v1, size_t v2); public: virtual Rcpp::RObject getDistMethod() { return Rf_mkString("euclidean"); } EuclideanDistance(const Rcpp::NumericMatrix& points) : GenericMatrixDistance(points) { // const double* items_ptr = items; // for (size_t i=0; i > ranks; public: virtual Rcpp::RObject getDistMethod() { return Rf_mkString("dinu"); } DinuDistanceInt(const Rcpp::List& strings) : StringDistanceInt(strings), ranks(n) { // TODO: openmp for (size_t i=0; i > ranks; public: virtual Rcpp::RObject getDistMethod() { return Rf_mkString("dinu"); } DinuDistanceChar(const Rcpp::CharacterVector& strings) : StringDistanceChar(strings), ranks(n) { // TODO: openmp for (size_t i=0; i. * * ************************************************************************* */ #ifndef __HCLUST2_MSTBASED_GINI_H #define __HCLUST2_MSTBASED_GINI_H // ************************************************************************ #include #include #include #include #include "hclust2_common.h" #include "disjoint_sets.h" #include "hclust2_result.h" namespace grup { class HClustMSTbasedGini { protected: HClustOptions* opts; size_t n; HClustStats stats; Distance* distance; HclustPriorityQueue getMST(); void linkAndRecomputeGini(PhatDisjointSets& ds, double& lastGini, size_t s1, size_t s2); public: HClustMSTbasedGini(Distance* dist, HClustOptions* opts); virtual ~HClustMSTbasedGini(); HClustResult compute(); inline const HClustStats& getStats() { return stats; } inline const HClustOptions& getOptions() { return *opts; } }; // class } // namespace grup #endif genie/src/hclust2_vptree_single.h0000644000176200001440000001135613324042165016621 0ustar liggesusers/* ************************************************************************* * * This file is part of the `genie` package for R. * * * * Copyright 2015-2018 Marek Gagolewski, Maciej Bartoszuk, Anna Cena * * * * 'genie' is free software: you can redistribute it and/or * * modify it under the terms of the GNU General Public License * * as published by the Free Software Foundation, either version 3 * * of the License, or (at your option) any later version. * * * * 'genie' is distributed in the hope that it will be useful, * * but WITHOUT ANY WARRANTY; without even the implied warranty of * * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * * * You should have received a copy of the GNU General Public License * * along with 'genie'. If not, see . * * ************************************************************************* */ #ifndef __HCLUST2_VPTREE_SINGLE_H #define __HCLUST2_VPTREE_SINGLE_H // ************************************************************************ #include "hclust2_nnbased_single.h" namespace grup { struct HClustVpTreeSingleNode { size_t vpindex; size_t left; size_t right; double radius; bool sameCluster; size_t maxindex; HClustVpTreeSingleNode* childL; HClustVpTreeSingleNode* childR; HClustVpTreeSingleNode() : vpindex(SIZE_MAX), left(SIZE_MAX), right(SIZE_MAX), radius(-INFINITY), sameCluster(false), maxindex(SIZE_MAX), childL(NULL), childR(NULL) { } HClustVpTreeSingleNode(size_t left, size_t right) : vpindex(SIZE_MAX), left(left), right(right), radius(-INFINITY), sameCluster(false), maxindex(SIZE_MAX), childL(NULL), childR(NULL) { } HClustVpTreeSingleNode(size_t vpindex, size_t left, size_t right, double radius) : vpindex(vpindex), left(left), right(right), radius(radius), sameCluster(false), maxindex(SIZE_MAX), childL(NULL), childR(NULL) { } ~HClustVpTreeSingleNode() { if (childL) delete childL; if (childR) delete childR; } }; class HClustVpTreeSingle : public HClustNNbasedSingle { protected: HClustVpTreeSingleNode* root; // bool visitAll; // for testing only size_t chooseNewVantagePoint(size_t left, size_t right); HClustVpTreeSingleNode* buildFromPoints(size_t left, size_t right, std::vector& distances); inline void getNearestNeighborsFromMinRadiusRecursive(HClustVpTreeSingleNode* node, size_t index, size_t clusterIndex, double minR, std::priority_queue& bestR, double& maxR, NNHeap& nnheap) { // search within (minR, maxR] STOPIFNOT(node != NULL); #ifdef GENERATE_STATS #ifdef _OPENMP #pragma omp atomic #endif ++stats.nodeVisit; #endif if (!prefetch && node->sameCluster && clusterIndex == ds.find_set(node->left)) return; if (node->vpindex == SIZE_MAX) { // leaf getNearestNeighborsFromMinRadiusRecursiveLeaf(node, index, clusterIndex, minR, bestR, maxR, nnheap); } else { getNearestNeighborsFromMinRadiusRecursiveNonLeaf(node, index, clusterIndex, minR, bestR, maxR, nnheap); } } void getNearestNeighborsFromMinRadiusRecursiveLeaf(HClustVpTreeSingleNode* node, size_t index, size_t clusterIndex, double minR, std::priority_queue& bestR, double& maxR, NNHeap& nnheap); void getNearestNeighborsFromMinRadiusRecursiveNonLeaf(HClustVpTreeSingleNode* node, size_t index, size_t clusterIndex, double minR, std::priority_queue& bestR, double& maxR, NNHeap& nnheap); virtual void getNearestNeighborsFromMinRadius(size_t index, size_t clusterIndex, double minR, NNHeap& nnheap) { std::priority_queue bestR; size_t minNN = (prefetch)?opts->minNNPrefetch:opts->minNNMerge; for (size_t i=0; i. * * ************************************************************************* */ #ifndef __HCLUST2_VPTREE_GINI_H #define __HCLUST2_VPTREE_GINI_H // ************************************************************************ #include "hclust2_nnbased_gini.h" namespace grup { struct HClustVpTreeGiniNode { size_t vpindex; size_t left; size_t right; double radius; bool sameCluster; size_t maxindex; HClustVpTreeGiniNode* childL; HClustVpTreeGiniNode* childR; HClustVpTreeGiniNode() : vpindex(SIZE_MAX), left(SIZE_MAX), right(SIZE_MAX), radius(-INFINITY), sameCluster(false), maxindex(SIZE_MAX), childL(NULL), childR(NULL) { } HClustVpTreeGiniNode(size_t left, size_t right) : vpindex(SIZE_MAX), left(left), right(right), radius(-INFINITY), sameCluster(false), maxindex(SIZE_MAX), childL(NULL), childR(NULL) { } HClustVpTreeGiniNode(size_t vpindex, size_t left, size_t right, double radius) : vpindex(vpindex), left(left), right(right), radius(radius), sameCluster(false), maxindex(SIZE_MAX), childL(NULL), childR(NULL) { } ~HClustVpTreeGiniNode() { if (childL) delete childL; if (childR) delete childR; } }; class HClustVpTreeGini : public HClustNNbasedGini { protected: HClustVpTreeGiniNode* root; // bool visitAll; // for testing only size_t chooseNewVantagePoint(size_t left, size_t right); HClustVpTreeGiniNode* buildFromPoints(size_t left, size_t right, std::vector& distances); inline void getNearestNeighborsFromMinRadiusRecursive(HClustVpTreeGiniNode* node, size_t index, size_t clusterIndex, double minR, std::priority_queue& bestR, double& maxR, NNHeap& nnheap) { // search within (minR, maxR] STOPIFNOT(node != NULL); #ifdef GENERATE_STATS #ifdef _OPENMP #pragma omp atomic #endif ++stats.nodeVisit; #endif if (!prefetch && node->sameCluster && clusterIndex == ds.find_set(node->left)) return; if (node->vpindex == SIZE_MAX) { // leaf getNearestNeighborsFromMinRadiusRecursiveLeaf(node, index, clusterIndex, minR, bestR, maxR, nnheap); } else { getNearestNeighborsFromMinRadiusRecursiveNonLeaf(node, index, clusterIndex, minR, bestR, maxR, nnheap); } } void getNearestNeighborsFromMinRadiusRecursiveLeaf(HClustVpTreeGiniNode* node, size_t index, size_t clusterIndex, double minR, std::priority_queue& bestR, double& maxR, NNHeap& nnheap); void getNearestNeighborsFromMinRadiusRecursiveNonLeaf(HClustVpTreeGiniNode* node, size_t index, size_t clusterIndex, double minR, std::priority_queue& bestR, double& maxR, NNHeap& nnheap); virtual void getNearestNeighborsFromMinRadius(size_t index, size_t clusterIndex, double minR, double& maxR, NNHeap& nnheap) { std::priority_queue bestR; size_t minNN = (prefetch)?opts->minNNPrefetch:opts->minNNMerge; for (size_t i=0; i. * * ************************************************************************* */ #include "hclust2_result.h" using namespace grup; using namespace Rcpp; HClustResult::HClustResult(size_t n, Distance* dist, bool lite) : curiter(0), n(n), links(n-1, 2), // this may be meaningless for some methods, do not return merge(n-1, 2), height(n-1), order(n, NA_REAL), labels(dist->getLabels()), dist_method(dist->getDistMethod()), lite(lite) { // no-op } void HClustResult::link(size_t i1, size_t i2, double d12) { STOPIFNOT(curiter < n-1); links(curiter, 0) = (double)i1; links(curiter, 1) = (double)i2; height(curiter) = d12; ++curiter; if (curiter == n-1 && !lite) { generateMergeMatrix(); generateOrderVector(); } } void HClustResult::generateOrderVector() { std::vector< std::list > relord(n+1); size_t clusterNumber = 1; for (size_t k=0; k::iterator it = relord[n-1].begin(); it != relord[n-1].end(); ++it) { order[k++] = (*it); } } void HClustResult::generateMergeMatrix() { STOPIFNOT(curiter == n-1); std::vector elements(n+1, 0); std::vector parents(n+1, 0); size_t clusterNumber = 1; for (size_t k=0; k merge(k, 1)) std::swap(merge(k, 0), merge(k, 1)); } } } List HClustResult::toR( const HClustStats& hclustStats, const HClustOptions& hclustOptions, const DistanceStats& distStats) { MESSAGE_2("[%010.3f] generating output matrix\n", clock()/(float)CLOCKS_PER_SEC); List result = List::create( _["merge"] = merge, _["height"] = height, _["order"] = order, _["labels"] = labels, _["call"] = R_NilValue, _["method"] = R_NilValue, _["dist.method"] = dist_method, _["links"] = links, _["stats"] = List::create( _["method"] = hclustStats.toR(), _["distance"] = distStats.toR() ), _["control"] = List::create( _["method"] = hclustOptions.toR() ) ); result.attr("class") = "hclust"; return result; } genie/src/hclust2_mstbased_gini.cpp0000644000176200001440000001615413672036546017133 0ustar liggesusers/* ************************************************************************* * * This file is part of the `genie` package for R. * * * * Copyright 2015-2018 Marek Gagolewski, Maciej Bartoszuk, Anna Cena * * * * 'genie' is free software: you can redistribute it and/or * * modify it under the terms of the GNU General Public License * * as published by the Free Software Foundation, either version 3 * * of the License, or (at your option) any later version. * * * * 'genie' is distributed in the hope that it will be useful, * * but WITHOUT ANY WARRANTY; without even the implied warranty of * * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * * * You should have received a copy of the GNU General Public License * * along with 'genie'. If not, see . * * ************************************************************************* */ #include "hclust2_mstbased_gini.h" #include "hclust2_vptree_single.h" using namespace grup; // constructor (OK, we all know what this is, but I label it for faster in-code search) HClustMSTbasedGini::HClustMSTbasedGini(Distance* dist, HClustOptions* opts) : opts(opts), n(dist->getObjectCount()), #ifdef GENERATE_STATS stats(), #endif distance(dist) { } HClustMSTbasedGini::~HClustMSTbasedGini() { /* pass */ } HclustPriorityQueue HClustMSTbasedGini::getMST() { MESSAGE_2("[%010.3f] determing the MST\n", clock()/(float)CLOCKS_PER_SEC); HclustPriorityQueue out(n); vector todo(n-1); // elements which are still not in the spanning tree for (size_t k=0; k Adist(n, INFINITY); std::vector Afrom(n, SIZE_MAX); size_t lastj = 0; // a randomly chosen element :) for (size_t i=0; i Adist[j]) { bestj = j; bestjpos = k; } #endif } #ifdef _OPENMP // to avoid establishing a (slow!) critical section in // the above loop, this fast part is done single-threadedly for (size_t k=0; k Adist[todo[k]]) { bestj = todo[k]; bestjpos = k; } } #endif out.push(HeapHierarchicalItem(Afrom[bestj], bestj, Adist[bestj])); todo.erase(todo.begin()+bestjpos); // the algorithm is O(n^2) anyway + we need to iterate thru todo sequentially lastj = bestj; if (i % 512 == 0) MESSAGE_7("\r get MST: %d / %d", i, n-1); Rcpp::checkUserInterrupt(); // may throw an exception, fast op, not thread safe } MESSAGE_7("\r get MST: %d / %d \n", n-1, n-1); return out; } void HClustMSTbasedGini::linkAndRecomputeGini(PhatDisjointSets& ds, double& lastGini, size_t s1, size_t s2) { // if opts.thresholdGini == 1.0, there's no need to compute the Gini index if (opts->thresholdGini < 1.0) { double size1 = ds.getClusterSize(s1); double size2 = ds.getClusterSize(s2); lastGini *= (n)*(double)(ds.getClusterCount()-1); std::size_t curi = s1; do { double curs = ds.getClusterSize(curi); lastGini -= std::fabs(curs-size1); lastGini -= std::fabs(curs-size2); lastGini += std::fabs(curs-size1-size2); curi = ds.getClusterNext(curi); } while (curi != s1); lastGini += std::fabs(size2-size1); lastGini -= std::fabs(size2-size1-size2); lastGini -= std::fabs(size1-size1-size2); } s1 = ds.link(s1, s2); if (opts->thresholdGini < 1.0) { lastGini /= (n)*(double)(ds.getClusterCount()-1); lastGini = std::min(1.0, std::max(0.0, lastGini)); // avoid numeric inaccuracies } } HClustResult HClustMSTbasedGini::compute() { HclustPriorityQueue pq; if (opts->useVpTree) { HClustVpTreeSingle hclust(distance, opts); HClustResult res = hclust.compute(/*merge,order not needed*/opts->thresholdGini < 1.0); if (opts->thresholdGini >= 1.0) return res; Rcpp::NumericMatrix links = res.getLinks(); Rcpp::NumericVector dist = res.getHeight(); STOPIFNOT((size_t)dist.size() == n-1); STOPIFNOT((size_t)links.nrow() == n-1); pq = HclustPriorityQueue(n); for (size_t i=0; i pq_cache; while (true) { STOPIFNOT(!pq.empty()) HeapHierarchicalItem hhi = pq.top(); pq.pop(); size_t s1 = ds.find_set(hhi.index1); size_t s2 = ds.find_set(hhi.index2); STOPIFNOT(s1 != s2); if (lastGini > opts->thresholdGini && ds.getClusterSize(s1) > minsize && ds.getClusterSize(s2) > minsize) { // the writelock is still in ON pq_cache.push_back(hhi); continue; } std::size_t lastminsize = minsize; res.link(hhi.index1, hhi.index2, (lastGini <= opts->thresholdGini)?hhi.dist:-hhi.dist); linkAndRecomputeGini(ds, lastGini, s1, s2); if (opts->thresholdGini < 1.0) minsize = ds.getMinClusterSize(); if (++i == n-1) break; if (pq_cache.size() > 0 && (pq.empty() || lastGini <= opts->thresholdGini || minsize != lastminsize)) { if (pq_cache.size() > 5) pq.reset(); // will call make_heap on next top() while (!pq_cache.empty()) { pq.push(pq_cache.back()); pq_cache.pop_back(); } } if (i % 512 == 0) MESSAGE_7("\r merge clusters: %d / %d", i+1, n-1); Rcpp::checkUserInterrupt(); // may throw an exception, fast op, not thread safe } // END WHILE MESSAGE_7("\r merge clusters: %d / %d \n", n-1, n-1); Rcpp::checkUserInterrupt(); return res; } genie/src/defs.h0000644000176200001440000001102313324042165013217 0ustar liggesusers/* ************************************************************************* * * This file is part of the `genie` package for R. * * * * Copyright 2015-2018 Marek Gagolewski, Maciej Bartoszuk, Anna Cena * * * * 'genie' is free software: you can redistribute it and/or * * modify it under the terms of the GNU General Public License * * as published by the Free Software Foundation, either version 3 * * of the License, or (at your option) any later version. * * * * 'genie' is distributed in the hope that it will be useful, * * but WITHOUT ANY WARRANTY; without even the implied warranty of * * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * * * You should have received a copy of the GNU General Public License * * along with 'genie'. If not, see . * * ************************************************************************* */ #ifndef __DEFS_H #define __DEFS_H #include /* SIZE_MAX, C++11 */ #include /* INFINITY, C++11 */ #include /* INFINITY, C++11 */ #ifndef INFINITY #define INFINITY (std::numeric_limits::infinity()) #endif #ifndef SIZE_MAX #define SIZE_MAX (std::numeric_limits::max()) #endif #ifdef _OPENMP #include #endif #ifdef _OPENMP #define MASTER_OR_SINGLE_THREAD (omp_get_thread_num() == 0) #else #define MASTER_OR_SINGLE_THREAD (1) #endif #ifdef _OPENMP #define OPENMP_ONLY(x) {x;} #else #define OPENMP_ONLY(x) ; #endif // --------------------------------------------------------------------- // #define HASHMAP_ENABLED #if 0 /* DEBUG MODE */ #define VERBOSE 1 #define GENERATE_STATS #define DISJOINT_SETS_DEBUG #define MEASURE_MEM_USE #else /* PRODUCTION USE */ #define VERBOSE 0 #endif #define DEFAULT_MAX_LEAVES_ELEMS 4 #define DEFAULT_MAX_NN_PREFETCH 256 #define DEFAULT_MAX_NN_MERGE maxNNPrefetch #define DEFAULT_MIN_NN_PREFETCH 20 #define DEFAULT_MIN_NN_MERGE minNNPrefetch #define DEFAULT_VP_SELECT_SCHEME 3 #define DEFAULT_VP_SELECT_CAND 5 #define DEFAULT_VP_SELECT_TEST 12 #define DEFAULT_NODES_VISITED_LIMIT SIZE_MAX #define DEFAULT_THRESHOLD_GINI 0.3 #define DEFAULT_USEVPTREE false #define DEFAULT_USEMST true // #define DEFAULT_GNAT_DEGREE 50 // #define DEFAULT_GNAT_CANDIDATES_TIMES 3 // #define DEFAULT_GNAT_MIN_DEGREE 2 // #define DEFAULT_GNAT_MAX_DEGREE 200 // #define DEFAULT_GNAT_MAX_TIMES_DEGREE 5 // #define DEFAULT_EXEMPLAR_UPDATE_METHOD 2 // #define DEFAULT_EXEMPLAR_MAX_LEAVES_ELEMS 32 // #define DEFAULT_IS_CURSE_OF_DIMENSIONALITY false // --------------------------------------------------------------------------- #if VERBOSE > 0 #define STOPIFNOT(EXPR) { if (!(EXPR)) Rprintf("\a*** Assert failed: " #EXPR " at %s, line %d ***\n", __FILE__, __LINE__); } #else #define STOPIFNOT(EXPR) { } #endif // --------------------------------------------------------------------------- #if VERBOSE > 0 #define RCOUT(msg, verlvl) if ((verlvl) <= VERBOSE) Rcout << msg << endl; #else #define RCOUT(msg, verlvl) { } #endif /* * example use: * int a = 5; * RCOUT("a=" << a, 0) * */ // --------------------------------------------------------------------- /* TO DO: can we do it more intelligently? */ #if VERBOSE < 1 #define MESSAGE_1(...) { } #else #define MESSAGE_1(...) Rprintf(__VA_ARGS__) #endif #if VERBOSE < 2 #define MESSAGE_2(...) { } #else #define MESSAGE_2(...) Rprintf(__VA_ARGS__) #endif #if VERBOSE < 3 #define MESSAGE_3(...) { } #else #define MESSAGE_3(...) Rprintf(__VA_ARGS__) #endif #if VERBOSE < 4 #define MESSAGE_4(...) { } #else #define MESSAGE_4(...) Rprintf(__VA_ARGS__) #endif #if VERBOSE < 5 #define MESSAGE_5(...) { } #else #define MESSAGE_5(...) Rprintf(__VA_ARGS__) #endif #if VERBOSE < 6 #define MESSAGE_6(...) { } #else #define MESSAGE_6(...) Rprintf(__VA_ARGS__) #endif #if VERBOSE < 7 #define MESSAGE_7(...) { } #else #define MESSAGE_7(...) Rprintf(__VA_ARGS__) #endif #if VERBOSE < 8 #define MESSAGE_8(...) { } #else #define MESSAGE_8(...) Rprintf(__VA_ARGS__) #endif #if VERBOSE < 9 #define MESSAGE_9(...) { } #else #define MESSAGE_9(...) Rprintf(__VA_ARGS__) #endif #endif genie/src/RcppExports.cpp0000644000176200001440000000203613324042165015126 0ustar liggesusers// Generated by using Rcpp::compileAttributes() -> do not edit by hand // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 #include using namespace Rcpp; // hclust2_gini RObject hclust2_gini(RObject distance, RObject objects, RObject control); RcppExport SEXP _genie_hclust2_gini(SEXP distanceSEXP, SEXP objectsSEXP, SEXP controlSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< RObject >::type distance(distanceSEXP); Rcpp::traits::input_parameter< RObject >::type objects(objectsSEXP); Rcpp::traits::input_parameter< RObject >::type control(controlSEXP); rcpp_result_gen = Rcpp::wrap(hclust2_gini(distance, objects, control)); return rcpp_result_gen; END_RCPP } static const R_CallMethodDef CallEntries[] = { {"_genie_hclust2_gini", (DL_FUNC) &_genie_hclust2_gini, 3}, {NULL, NULL, 0} }; RcppExport void R_init_genie(DllInfo *dll) { R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); R_useDynamicSymbols(dll, FALSE); } genie/src/disjoint_sets.cpp0000644000176200001440000001575413324042165015531 0ustar liggesusers/* ************************************************************************* * * This file is part of the `genie` package for R. * * * * Copyright 2015-2018 Marek Gagolewski, Maciej Bartoszuk, Anna Cena * * * * 'genie' is free software: you can redistribute it and/or * * modify it under the terms of the GNU General Public License * * as published by the Free Software Foundation, either version 3 * * of the License, or (at your option) any later version. * * * * 'genie' is distributed in the hope that it will be useful, * * but WITHOUT ANY WARRANTY; without even the implied warranty of * * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * * GNU General Public License for more details. * * * * You should have received a copy of the GNU General Public License * * along with 'genie'. If not, see . * * ************************************************************************* */ #include "disjoint_sets.h" using namespace grup; DisjointSets::DisjointSets(std::size_t _n) : clusterParent(std::vector< std::size_t >(_n)), n(_n) { #ifdef DISJOINT_SETS_DEBUG MESSAGE_1("Warning: DISJOINT_SETS_DEBUG defined in disjoint_sets.h\n"); #endif for (std::size_t i=0; i(_n, 1)), clusterMembers(_n), clusterNext(std::vector< std::size_t >(_n)), clusterPrev(std::vector< std::size_t >(_n)) { clusterCount = _n; minClusterSize = 1; minClusterCount = _n; for (std::size_t i=0; i<_n; ++i) { clusterMembers[i] = (std::size_t*)malloc(sizeof(std::size_t)*1); clusterMembers[i][0] = i; // clusterSize[i] = 1; // already initialized clusterNext[i] = (i<_n-1)?(i+1):0; clusterPrev[i] = (i>0)?(i-1):(_n-1); } } PhatDisjointSets::~PhatDisjointSets() { #ifdef DISJOINT_SETS_DEBUG STOPIFNOT(getClusterCount() == 1); STOPIFNOT(getClusterSize(find_set(0)) == n); STOPIFNOT(getClusterNext(find_set(0)) == find_set(0)); STOPIFNOT(getClusterPrev(find_set(0)) == find_set(0)); #endif for (std::size_t i=0; i 2) { std::size_t oldprev = clusterPrev[y]; std::size_t oldnext = clusterNext[y]; clusterPrev[ oldnext ] = oldprev; clusterNext[ oldprev ] = oldnext; } else { clusterPrev[ z ] = z; clusterNext[ z ] = z; } clusterMembers[z] = (std::size_t*)realloc(clusterMembers[z], (clusterSize[x]+clusterSize[y])*sizeof(std::size_t)); memcpy(clusterMembers[z]+clusterSize[x], clusterMembers[y], clusterSize[y]*sizeof(std::size_t)); free(clusterMembers[y]); clusterMembers[y] = NULL; clusterSize[z] += clusterSize[y]; // clusterMembers[z]->splice(clusterMembers[z]->end(), *(clusterMembers[y])); // O(1) --clusterCount; if (minClusterCount > 0 && sizex == minClusterSize) minClusterCount--; if (minClusterCount > 0 && sizey == minClusterSize) minClusterCount--; if (minClusterCount == 0) recomputeMinClusterSize(); return z; } std::size_t PhatDisjointSets::link(std::size_t x, std::size_t y, std::size_t z) { std::size_t z2 = DisjointSets::link(x, y, z); #ifdef DISJOINT_SETS_DEBUG STOPIFNOT(z == z2); #endif if (clusterCount > 2) { std::size_t oldprev = clusterPrev[y]; std::size_t oldnext = clusterNext[y]; clusterPrev[ oldnext ] = oldprev; clusterNext[ oldprev ] = oldnext; oldprev = clusterPrev[x]; oldnext = clusterNext[x]; clusterPrev[z2] = oldprev; clusterNext[z2] = oldnext; #ifdef DISJOINT_SETS_DEBUG STOPIFNOT(clusterPrev[ oldnext ] == x); STOPIFNOT(clusterNext[ oldprev ] == x); #endif clusterPrev[ oldnext ] = z2; clusterNext[ oldprev ] = z2; } else { clusterPrev[ z2 ] = z2; clusterNext[ z2 ] = z2; } // clusterMembers[x]->splice(clusterMembers[x]->end(), (*clusterMembers[y])); // O(1) // delete clusterMembers[y]; // clusterMembers[y] = NULL; // #ifdef DISJOINT_SETS_DEBUG // if (x != z2 && clusterMembers[z2]) // Rcpp::stop("PhatDisjointSets::link assert failed"); // #endif // std::swap(clusterMembers[z2], clusterMembers[x]); // clusterSize[z2] = clusterSize[x] + clusterSize[y]; clusterMembers[x] = (std::size_t*)realloc(clusterMembers[x], (clusterSize[x]+clusterSize[y])*sizeof(std::size_t)); memcpy(clusterMembers[x]+clusterSize[x], clusterMembers[y], clusterSize[y]*sizeof(std::size_t)); free(clusterMembers[y]); clusterMembers[y] = NULL; #ifdef DISJOINT_SETS_DEBUG STOPIFNOT(x == z2 || !clusterMembers[z2]); #endif std::swap(clusterMembers[z2], clusterMembers[x]); #ifdef DISJOINT_SETS_DEBUG STOPIFNOT(x == z2 || (!clusterMembers[x] && clusterMembers[z2])); #endif clusterSize[z2] = clusterSize[x] + clusterSize[y]; --clusterCount; return z2; } void PhatDisjointSets::recomputeMinClusterSize() { std::size_t start = find_set(0); minClusterSize = getClusterSize(start); minClusterCount = 1; size_t curi = getClusterNext(start); while (curi != start) { std::size_t curSize = getClusterSize(curi); if (curSize == minClusterSize) minClusterCount++; else if (curSize < minClusterSize) { minClusterSize = curSize; minClusterCount = 1; } curi = getClusterNext(curi); } } genie/NEWS0000644000176200001440000000216513711501062012040 0ustar liggesusers genie package NEWS and CHANGELOG =============================================================================== ## 1.0.5 (2020-08-02) * Updated documentation and package metadata. * This package has been superseded by `genieclust`, which is faster and more feature-rich (and also available for Python). ## 1.0.4 (2017-04-27) * Invalid DOI corrected. ## 1.0.3 (2017-04-27) * [BUILD TIME] Registering native routines and disabling symbol search. ## 1.0.1 (2016-05-25) * Updated documentation and package metadata. The algorithm's description can now be found in: Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, pp. 8-23, doi:10.1016/j.ins.2016.05.003 See also: Gagolewski M., Cena A., Bartoszuk M., Hierarchical clustering via penalty-based aggregation and the Genie approach, In: Torra V. et al. (Eds.), Modeling Decisions for Artificial Intelligence (Lecture Notes in Artificial Intelligence 9880), Springer, 2016, pp. 191-202, doi:10.1007/978-3-319-45656-0_16. ## 1.0.0 (2016-03-07) * Initial release. genie/R/0000755000176200001440000000000013711501666011550 5ustar liggesusersgenie/R/hclust2.R0000644000176200001440000001402713711501356013257 0ustar liggesusers#' @title #' Fast Hierarchical Clustering in Spaces Equipped With #' a Dissimilarity Measure #' #' @description #' The reference implementation of the fast, robust and outlier resistant #' Genie algorithm described in (Gagolewski, Bartoszuk, Cena, 2016). #' Note that the \code{genie} package has been superseded by \code{genieclust}, #' see \code{\link[genieclust]{gclust}} and \code{\link[genieclust]{genie}} #' for more details. #' #' @param d an object of class \code{\link[stats]{dist}}, #' \code{NULL}, or a single string, see below #' @param objects \code{NULL}, numeric matrix, a list, or a character vector #' @param thresholdGini single numeric value in [0,1], #' threshold for the Gini index, 1 gives the standard single linkage algorithm #' @param useVpTree single logical value, whether to use a vantage-point tree #' to speed up nearest neighbour searching in low-dimensional spaces #' @param ... internal parameters used to tune up the algorithm #' #' @details #' The time needed to apply a hierarchical clustering algorithm #' is most often dominated by the number of computations of a pairwise #' dissimilarity measure. Such a constraint, for larger data sets, #' puts at a disadvantage the use of all the classical linkage #' criteria but the single linkage one. However, it is known that the single #' linkage clustering algorithm is very sensitive to outliers, produces highly #' skewed dendrograms, and therefore usually does not reflect the true #' underlying data structure -- unless the clusters are well-separated. #' #' To overcome its limitations, in (Gagolewski, Bartoszuk, Cena, 2016) #' we proposed a new hierarchical clustering linkage #' criterion. Namely, our algorithm links two clusters in such a way that a chosen #' economic inequity measure (here, the Gini index) of the cluster #' sizes does not increase drastically above a given threshold. The #' benchmarks indicate a high practical usefulness of the introduced method: #' it most often outperforms the Ward or average linkage in terms of the #' clustering quality while retaining the single linkage speed. #' The algorithm can be run in parallel (via OpenMP) on multiple threads #' to speed up its execution further on. #' Its memory overhead is small: there is no need to precompute the complete #' distance matrix to perform the computations in order to obtain a desired #' clustering. #' #' For compatibility with \code{\link[stats]{hclust}}, \code{d} may be an object #' of class \code{\link[stats]{dist}}. In such a case, the \code{objects} #' argument is ignored. Note that such an object requires ca. \emph{8n(n-1)/2} #' bytes of computer's memory, where \emph{n} is the number of objects to cluster, #' and therefore this setting can be used to analyse data sets of sizes #' up to about 10,000-50,000. #' #' If \code{objects} is a character vector or a list, then \code{d} #' should be a single string, one of: \code{levenshtein} (or \code{NULL}), #' \code{hamming}, \code{dinu} (Dinu, Sgarro, 2006), #' or \code{euclinf} (Cena et al., 2015). #' Note that the list must consist #' either of integer or of numeric vectors only (depending on the dissimilarity #' measure of choice). On the other hand, each string must be in ASCII, #' but you can always convert it to UTF-32 with #' \code{\link[stringi]{stri_enc_toutf32}}. #' #' Otherwise, if \code{objects} is a numeric matrix (here, each row #' denotes a distinct observation), then \code{d} should be #' a single string, one of: \code{euclidean_squared} (or \code{NULL}), #' \code{euclidean} (which yields the same results as \code{euclidean_squared}) #' \code{manhattan}, \code{maximum}, or \code{hamming}. #' #' If \code{useVpTree} is \code{FALSE}, then the dissimilarity measure #' of choice is guaranteed to be computed for each unique pair of \code{objects} #' only once. #' #' @return #' A named list of class \code{hclust}, see \code{\link[stats]{hclust}}, #' with additional components: #' \itemize{ #' \item \code{stats} - performance statistics #' \item \code{control} - internal parameters used #' } #' #' @examples #' library("datasets") #' data("iris") #' h <- hclust2(objects=as.matrix(iris[,2:3]), thresholdGini=0.2) #' plot(iris[,2], iris[,3], col=cutree(h, 3), pch=as.integer(iris[,5]), asp=1, las=1) #' #' @references #' Cena A., Gagolewski M., Mesiar R., Problems and challenges of information #' resources producers' clustering, \emph{Journal of Informetrics} 9(2), 2015, #' pp. 273-284. #' #' Dinu L.P., Sgarro A., A Low-complexity Distance for DNA Strings, #' \emph{Fundamenta Informaticae} 73(3), 2006, pp. 361-372. #' #' Gagolewski M., Bartoszuk M., Cena A., #' Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, #' \emph{Information Sciences} 363, 2016, pp. 8-23. #' #' Gagolewski M., Cena A., Bartoszuk M. #' \emph{Hierarchical clustering via penalty-based aggregation and the Genie #' approach}, In: Torra V. et al. (Eds.), \emph{Modeling Decisions for #' Artificial Intelligence} (\emph{Lecture Notes in Artificial Intelligence} #' 9880), Springer, 2016. #' #' @importFrom stats approx #' @importFrom genieclust gclust #' @importFrom genieclust genie #' @export hclust2 <- function(d=NULL, objects=NULL, thresholdGini=0.3, useVpTree=FALSE, ...) { opts <- list(thresholdGini=thresholdGini, useVpTree=useVpTree, ...) result <- .hclust2_gini(d, objects, opts) result[["call"]] <- match.call() result[["method"]] <- "gini" if (any(result[["height"]]<0)) { # corrections for departures from ultrametricity # negative heights denote force Genie merges # we could just use have used cummax, but then we'd get multiple # merges at the same level; instead we'll linearly interpolate # between the points nonNegative <- which(result[["height"]]>=0) lastNonNegative <- nonNegative[length(nonNegative)] result[["height"]][1:lastNonNegative] <- approx(nonNegative, # linear interpolation result[["height"]][nonNegative], 1:lastNonNegative)$y result[["height"]][result[["height"]] < 0] <- cummax(-result[["height"]][result[["height"]] < 0]) } result } genie/R/RcppExports.R0000644000176200001440000000037613711501666014172 0ustar liggesusers# Generated by using Rcpp::compileAttributes() -> do not edit by hand # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 .hclust2_gini <- function(distance, objects, control = NULL) { .Call(`_genie_hclust2_gini`, distance, objects, control) } genie/R/genie-package.R0000644000176200001440000000050713706444020014347 0ustar liggesusers#' @title The Genie Package #' #' @description #' See \code{\link{hclust2}()} for details. #' #' #' @name genie-package #' @rdname genie-package #' @aliases genie #' @docType package #' @author Marek Gagolewski, Maciej Bartoszuk, Anna Cena #' #' @useDynLib genie, .registration=TRUE #' @importFrom Rcpp evalCpp invisible(NULL) genie/MD50000644000176200001440000000340613711633542011661 0ustar liggesusers42663de693e865e7d0359c0611c5074c *DESCRIPTION 0cce106eda8a0102d80da3495233fff5 *NAMESPACE b74c80fc9c4aa3d5d524e4050ccf5812 *NEWS 1d29001a8b2960673856de5c93b19fc8 *R/RcppExports.R b67df35b878a09622912e1d4257d02f4 *R/genie-package.R 676112538af4b9f23f29eb34b944cac8 *R/hclust2.R dabba701ca604243b3ff0a8f445f0c4e *inst/CITATION 823b6d1e4cdb4eeab91127f8055e6052 *man/genie-package.Rd 6d608dc47d2afecbcf787557af0e549d *man/hclust2.Rd d75f89b94a6d7b50ade66d8acfb955f2 *src/Makevars d75f89b94a6d7b50ade66d8acfb955f2 *src/Makevars.win 19595723ebd770cb008437d80d8cf603 *src/RcppExports.cpp 2925a32b678e40604b768a715a1b2455 *src/defs.h 23c393db11c830fe05d0698856cf6687 *src/disjoint_sets.cpp b14644229bf9a2b90ba7b245c6399fd6 *src/disjoint_sets.h 2ca9c30a531dc4d8c9fbd557a6fecbc6 *src/hclust2_common.cpp ddf7842901942ddb16bff5c4339d9ba3 *src/hclust2_common.h 88f82667150c32870d846b3042b06f0e *src/hclust2_distance.cpp 957ad5898889dc1c55b1dd7a150b7d2c *src/hclust2_distance.h 773d5732fa8ad1fd66a6dcadaa086f4e *src/hclust2_mstbased_gini.cpp 2b9fe7f10fefe66ee37cd2d784204533 *src/hclust2_mstbased_gini.h 9dbab5f188885666c8489d169b64e0ed *src/hclust2_nnbased_gini.h b919f4e167d7d5b17372e4924b47a943 *src/hclust2_nnbased_single.cpp 01bf361ff1d8931b6bd1c22b4bc584d7 *src/hclust2_nnbased_single.h d1d4fc221e82d968d2721e6c9b8a8ec5 *src/hclust2_rcpp_gini.cpp 6a68439aa9f73384d5b922ad1047b997 *src/hclust2_result.cpp ef839ef7e7408b73ff91774efb1b5ffd *src/hclust2_result.h 163a032a3ac62850bbb72dfe930a7fd5 *src/hclust2_vptree_gini.h d7dba4d9802a3c40390cdcc30c4962dc *src/hclust2_vptree_single.cpp 9463bd5793842b9cfc79330931c7aebe *src/hclust2_vptree_single.h f42c9bc4435bb983274895a0f4fc9255 *src/init.cpp ac0b48e83e47e2efe2d9e15b791155ac *tests/testthat.R 5a2370f089507fe281c93e3c512a0314 *tests/testthat/test-single.R genie/inst/0000755000176200001440000000000013241042746012321 5ustar liggesusersgenie/inst/CITATION0000644000176200001440000000131213241042746013453 0ustar liggesusersbibentry( bibtype = "article", title = "{Genie}: {A} new, fast, and outlier-resistant hierarchical clustering algorithm", author = personList( as.person("Marek Gagolewski"), as.person("Maciej Bartoszuk"), as.person("Anna Cena") ), journal = "Information Sciences", year = "2016", volume = "363", pages = "8--23", doi = "10.1016/j.ins.2016.05.003", textVersion = paste( "Gagolewski M., Bartoszuk M., Cena A.", "Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm.", "Information Sciences 363, 2016, pp. 8-23, doi:10.1016/j.ins.2016.05.003." ) )