knn.covertree/0000755000176200001440000000000013555610202013034 5ustar liggesusersknn.covertree/NAMESPACE0000644000176200001440000000045213540637510014261 0ustar liggesusers# Generated by roxygen2: do not edit by hand export(find_knn) importFrom(Matrix,forceSymmetric) importFrom(Matrix,skewpart) importFrom(Matrix,symmpart) importFrom(Rcpp,evalCpp) importFrom(RcppEigen,RcppEigen.package.skeleton) importFrom(methods,as) importFrom(methods,is) useDynLib(knn.covertree) knn.covertree/README.md0000644000176200001440000000371213540637125014325 0ustar liggesusersknn.covertree [![Workflow badge][]](https://github.com/flying-sheep/knn.covertree/commits/master) ============= [Workflow badge]: https://github.com/flying-sheep/knn.covertree/workflows/Build%20R%20package/badge.svg A package for precise approximative nearest neighbor search in more than just euclidean space. Its only exported function `find_knn` computes the `k` nearest neighbors of the rows of the `query` matrix in the `data` matrix. If no `query` matrix is passed, the nearest neighbors for all rows in the data will be returned (i.e. `data` will be used as `query`). [generic]: https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/dgCMatrix-class.html [symmetric]: https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/dsCMatrix-class.html ```r find_knn( data, k, ..., query = NULL, distance = c("euclidean", "cosine", "rankcor"), sym = TRUE) ``` The result will be a list containing - `index`, a `nrow(query)` × `k` integer matrix containing the row indices into `data` that are the nearest neighbors. - `dist`, a `nrow(query)` × `k` double matrix containing the `distance`s to those neighbors. - `dist_mat`, a `nrow(query)` × `nrow(data)` a `Matrix::dSparseMatrix`, [generic][] if `!sym` or `!is.null(query)`, and [symmetric][] if `sym` and `is.null(query)`. Zeros in this matrix mean “not a knn”, and if `sym` is set, the matrix will be post processed to be symmetric. (Without post processing, the matrix will likely be asymmetric as `r1∈kNN(r2)` does not imply `r2∈knn(r1)`) This package was separated from [destiny][] as it might prove helpful in other contexts. It provides more distance metrics than [FNN][] and is more precise than [RcppHNSW][], but slower than both. If anyone knows a faster and similarly precise kNN search in cosine (=rank correlation) space, please tell me! [destiny]: http://bioconductor.org/packages/destiny/ [FNN]: https://CRAN.R-project.org/package=FNN [RcppHNSW]: https://github.com/jlmelville/rcpphnsw knn.covertree/man/0000755000176200001440000000000013522503623013611 5ustar liggesusersknn.covertree/man/knn.Rd0000644000176200001440000000410613540407626014675 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/knn.r \name{find_knn} \alias{find_knn} \title{kNN search} \usage{ find_knn(data, k, ..., query = NULL, distance = c("euclidean", "cosine", "rankcor"), sym = TRUE) } \arguments{ \item{data}{Data matrix} \item{k}{Number of nearest neighbors} \item{...}{Unused. All parameters to the right of the \code{...} have to be specified by name (e.g. \code{find_knn(data, k, distance = 'cosine')})} \item{query}{Query matrix. In \code{knn} and \code{knn_asym}, query and data are identical} \item{distance}{Distance metric to use. Allowed measures: Euclidean distance (default), cosine distance (\eqn{1-corr(c_1, c_2)}) or rank correlation distance (\eqn{1-corr(rank(c_1), rank(c_2))})} \item{sym}{Return a symmetric matrix (as long as query is NULL)?} } \value{ A \code{\link{list}} with the entries: \describe{ \item{\code{index}}{A \eqn{nrow(data) \times k} \link{integer} \link{matrix} containing the indices of the k nearest neighbors for each cell.} \item{\code{dist}}{A \eqn{nrow(data) \times k} \link{double} \link{matrix} containing the distances to the k nearest neighbors for each cell.} \item{\code{dist_mat}}{ A \code{\link[Matrix:dgCMatrix-class]{dgCMatrix}} if \code{sym == TRUE}, else a \code{\link[Matrix:dsCMatrix-class]{dsCMatrix}} (\eqn{nrow(query) \times nrow(data)}). Any zero in the matrix (except for the diagonal) indicates that the cells in the corresponding pair are close neighbors. } } } \description{ k nearest neighbor search with custom distance function. } \examples{ # The default: symmetricised pairwise distances between all rows pairwise <- find_knn(mtcars, 5L) image(as.matrix(pairwise$dist_mat)) # Nearest neighbors of a subset within all mercedeses <- grepl('Merc', rownames(mtcars)) merc_vs_all <- find_knn(mtcars, 5L, query = mtcars[mercedeses, ]) # Replace row index matrix with row name matrix matrix( rownames(mtcars)[merc_vs_all$index], nrow(merc_vs_all$index), dimnames = list(rownames(merc_vs_all$index), NULL) )[, -1] # 1st nearest neighbor is always the same row } knn.covertree/man/knn.covertree-package.Rd0000644000176200001440000000060313522503623020253 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/knn.covertree-package.r \docType{package} \name{knn.covertree} \alias{knn.covertree} \alias{knn.covertree-package} \title{A not-too-fast but accurate kNN implementation supporting multiple distance measures} \description{ A not-too-fast but accurate kNN implementation supporting multiple distance measures } knn.covertree/DESCRIPTION0000644000176200001440000000235413555610202014546 0ustar liggesusersPackage: knn.covertree Type: Package Title: An Accurate kNN Implementation with Multiple Distance Measures Version: 1.0 Date: 2019-10-24 Authors@R: c( person('Philipp', 'Angerer', NULL, 'philipp.angerer@helmholtz-muenchen.de', c('cre', 'aut'), comment = c(ORCID = '0000-0002-0369-2888')), person('David', 'Crane', NULL, 'dncrane@gmail.com', c('cph', 'aut'))) Maintainer: Philipp Angerer Description: Similarly to the 'FNN' package, this package allows calculation of the k nearest neighbors (kNN) of a data matrix. The implementation is based on cover trees introduced by Alina Beygelzimer, Sham Kakade, and John Langford (2006) . URL: https://github.com/flying-sheep/knn.covertree BugReports: https://github.com/flying-sheep/knn.covertree/issues License: AGPL-3 Imports: Rcpp (>= 1.0.2), RcppEigen (>= 0.3.3.5.0), Matrix, methods Suggests: testthat, FNN LinkingTo: Rcpp, RcppEigen SystemRequirements: C++11 NeedsCompilation: yes Encoding: UTF-8 RoxygenNote: 6.1.1 Packaged: 2019-10-24 08:59:51 UTC; angerer Author: Philipp Angerer [cre, aut] (), David Crane [cph, aut] Repository: CRAN Date/Publication: 2019-10-28 16:00:02 UTC knn.covertree/tests/0000755000176200001440000000000013540362041014175 5ustar liggesusersknn.covertree/tests/testthat/0000755000176200001440000000000013555610202016036 5ustar liggesusersknn.covertree/tests/testthat/test-knn.r0000644000176200001440000000135613540424576020004 0ustar liggesuserscontext('kNN integrity') test_that('knn works the same as FNN', { skip_if_not_installed('FNN') e <- as.matrix(datasets::mtcars) r_destiny <- find_knn(e, 5L) r_fnn <- FNN::get.knn(e, 5L) dimnames(r_destiny$index) <- dimnames(r_destiny$dist) <- NULL expect_equal(r_destiny$dist, r_fnn$nn.dist) }) test_that('knnx works the same as to FNN', { skip_if_not_installed('FNN') e <- as.matrix(datasets::mtcars) am <- factor(e[, 'am'], labels = c('automatic', 'manual')) r_destiny <- find_knn(e[am == 'manual', ], 5L, query = e[am == 'automatic', ]) r_fnn <- FNN::get.knnx(e[am == 'manual', ], e[am == 'automatic', ], 5L) dimnames(r_destiny$index) <- dimnames(r_destiny$dist) <- NULL expect_equal(r_destiny$dist, r_fnn$nn.dist) }) knn.covertree/tests/testthat.R0000644000176200001440000000010613540362130016154 0ustar liggesuserslibrary(testthat) library(knn.covertree) test_check('knn.covertree') knn.covertree/src/0000755000176200001440000000000013554264007013632 5ustar liggesusersknn.covertree/src/exports.h0000644000176200001440000000056513522503623015510 0ustar liggesusers#ifndef _DESTINY_EXPORTS_H #define _DESTINY_EXPORTS_H #include #include using namespace Rcpp; List knn_cross(const NumericMatrix data, const NumericMatrix query, const size_t k, const std::string distance); List knn_asym(const NumericMatrix data, const size_t k, const std::string distance); NumericMatrix rank_mat(const NumericMatrix x); #endif knn.covertree/src/cover_tree.h0000644000176200001440000005162213522503623016141 0ustar liggesusers/* https://github.com/DNCrane/Cover-Tree/commit/b6de5de9cafcc496f640dd8ddc63e061d0226519 * Copyright (C) 2011 by Singularity Institute for Artificial Intelligence * All Rights Reserved * * Written by David Crane * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License v3 as * published by the Free Software Foundation and including the exceptions * at http://opencog.org/wiki/Licenses * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU Affero General Public License * along with this program; if not, write to: * Free Software Foundation, Inc., * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. */ #ifndef _COVER_TREE_H #define _COVER_TREE_H #include #include #include #include #include #include #include /** * Cover Tree. Allows for insertion, removal, and k-nearest-neighbor * queries. * * The user should define double Point::distance(const Point& p) and * bool Point::operator==(const Point& p), where * p1.distance(p2)==0 doesn't necessarily mean that p1==p2). * * For example, a point could consist of a vector and a string * name, where their distance measure is simply euclidean distance but to be * equal they must have the same name in addition to having distance 0. */ template class CoverTree { /** * Cover tree node. Consists of arbitrarily many points P, as long as * they have distance 0 to each other. Keeps track of its children. */ class CoverTreeNode { private: //_childMap[i] is a vector of the node's children at level i std::map > _childMap; //_points is all of the points with distance 0 which are not equal. std::vector _points; public: CoverTreeNode(const Point& p); /** * Returns the children of the node at level i. Note that this means * the children exist in cover set i-1, not level i. * * Does not include the node itself, though technically every node * has itself as a child in a cover tree. */ std::vector getChildren(int level) const; void addChild(int level, CoverTreeNode* p); void removeChild(int level, CoverTreeNode* p); void addPoint(const Point& p); void removePoint(const Point& p); const std::vector& getPoints() { return _points; } double distance(const CoverTreeNode& p) const; bool isSingle() const; bool hasPoint(const Point& p) const; const Point& getPoint() const; /** * Return every child of the node from any level. This is handy for * the destructor. */ std::vector getAllChildren() const; }; // CoverTreeNode class private: typedef std::pair distNodePair; CoverTreeNode* _root; unsigned int _numNodes; int _maxLevel;//base^_maxLevel should be the max distance //between any 2 points int _minLevel;//A level beneath which there are no more new nodes. std::set kNearestNodeSet(const Point& p, const unsigned int& k) const; std::vector kNearestNodes(const Point& p, const unsigned int& k) const; /** * Recursive implementation of the insert algorithm (see paper). */ bool insert_rec(const Point& p, const std::vector& Qi, const int& level); /** * Finds the node in Q with the minimum distance to p. Returns a * pair consisting of this node and the distance. */ distNodePair distance(const Point& p, const std::vector& Q); void remove_rec(const Point& p, std::map >& coverSets, int level, bool& multi); public: static constexpr double base = 2.0; /** * Constructs a cover tree which begins with all points in points. * * maxDist should be the maximum distance that any two points * can have between each other. IE p.distance(q) < maxDist for all * p,q that you will ever try to insert. The cover tree may be invalid * if an inaccurate maxDist is given. */ CoverTree(const double& maxDist, const std::vector& points=std::vector()); ~CoverTree(); /** * Just for testing/debugging. Returns true iff the cover tree satisfies the * the covering tree invariants (every node in level i is greater than base^i * distance from every other node, and every node in level i is less than * or equal to base^i distance from its children). See the cover tree * papers for details. */ bool isValidTree() const; /** * Insert newPoint into the cover tree. If newPoint is already present, * (that is, newPoint==p for some p already in the tree), then the tree * is unchanged. If p.distance(newPoint)==0.0 but newPoint!=p, then * newPoint WILL be inserted and both points may be returned in k-nearest- * neighbor searches. */ void insert(const Point& newPoint); /** * Remove point p from the cover tree. If p is not present in the tree, * it will remain unchanged. Otherwise, this will remove exactly one * point q from the tree satisfying p==q. */ void remove(const Point& p); std::vector> kNearestNeighborDists(const Point& p, const unsigned int& k) const; /** * Returns the k nearest points to p in order (the 0th element of the vector * is closest to p, 1th is next, etc). It may return greater than k points * if there is a tie for the kth place. */ std::vector kNearestNeighbors(const Point& p, const unsigned int& k) const; CoverTreeNode* getRoot() const; /** * Print the cover tree. */ void print() const; }; // CoverTree class template CoverTree::CoverTree(const double& maxDist, const std::vector& points) { _root=NULL; _numNodes=0; _maxLevel=ceilf(log(maxDist)/log(base)); _minLevel=_maxLevel-1; typename std::vector::const_iterator it; for(it=points.begin(); it!=points.end(); ++it) { this->insert(*it); } } template CoverTree::~CoverTree() { if(_root==NULL) return; //Get all of the root's children (from any level), //delete the root, repeat for each of the children std::vector nodes; nodes.push_back(_root); while(!nodes.empty()) { CoverTreeNode* byeNode = nodes[0]; nodes.erase(nodes.begin()); std::vector children = byeNode->getAllChildren(); nodes.insert(nodes.begin(),children.begin(),children.end()); //std::cout << _numNodes << "\n"; delete byeNode; //_numNodes--; } } template std::set::CoverTreeNode*>> CoverTree::kNearestNodeSet(const Point& p, const unsigned int& k) const { if(_root==NULL) return std::set(); //maxDist is the kth nearest known point to p, and also the farthest //point from p in the set minNodes defined below. double maxDist = p.distance(_root->getPoint()); //minNodes stores the k nearest known points to p. std::set minNodes; minNodes.insert(std::make_pair(maxDist,_root)); std::vector Qj(1,std::make_pair(maxDist,_root)); for(int level = _maxLevel; level>=_minLevel;level--) { typename std::vector::const_iterator it; int size = Qj.size(); for(int i=0; i children = Qj[i].second->getChildren(level); typename std::vector::const_iterator it2; for(it2=children.begin(); it2!=children.end(); ++it2) { double d = p.distance((*it2)->getPoint()); if(d < maxDist || minNodes.size() < k) { minNodes.insert(std::make_pair(d,*it2)); //--minNodes.end() gives us an iterator to the greatest //element of minNodes. if(minNodes.size() > k) minNodes.erase(--minNodes.end()); maxDist = (--minNodes.end())->first; } Qj.push_back(std::make_pair(d,*it2)); } } double sep = maxDist + pow(base, level); size = Qj.size(); for(int i=0; i sep) { //quickly removes an element from a vector w/o preserving order. Qj[i]=Qj.back(); Qj.pop_back(); size--; i--; } } } return minNodes; } template std::vector::CoverTreeNode*> CoverTree::kNearestNodes(const Point& p, const unsigned int& k) const { if(_root==NULL) return std::vector(); std::vector kNN; for (distNodePair const& dist_node : kNearestNodeSet(p, k)) { kNN.push_back(dist_node.second); } return kNN; } template bool CoverTree::insert_rec(const Point& p, const std::vector& Qi, const int& level) { std::vector > Qj; double sep = pow(base,level); double minDist = DBL_MAX; std::pair minQiDist(DBL_MAX,NULL); typename std::vector >::const_iterator it; for(it=Qi.begin(); it!=Qi.end(); ++it) { if(it->firstfirstfirst; if(it->first<=sep) Qj.push_back(*it); std::vector children = it->second->getChildren(level); typename std::vector::const_iterator it2; for(it2=children.begin();it2!=children.end();++it2) { double d = p.distance((*it2)->getPoint()); if(d sep) { return true; } else { bool found = insert_rec(p,Qj,level-1); //distNodePair minQiDist = distance(p,Qi); if(found && minQiDist.first <= sep) { if(level-1<_minLevel) _minLevel=level-1; minQiDist.second->addChild(level, new CoverTreeNode(p)); //std::cout << "parent is "; //minQiDist.second->getPoint().print(); _numNodes++; return false; } else { return found; } } } template void CoverTree::remove_rec(const Point& p, std::map >& coverSets, int level, bool& multi) { std::vector& Qi = coverSets[level]; std::vector& Qj = coverSets[level-1]; double minDist = DBL_MAX; CoverTreeNode* minNode = _root; CoverTreeNode* parent = 0; double sep = pow(base, level); typename std::vector::const_iterator it; //set Qj to be all children q of Qi such that p.distance(q)<=sep //and also keep track of the minimum distance from p to a node in Qj //note that every node has itself as a child, but the //getChildren function only returns non-self-children. for(it=Qi.begin();it!=Qi.end();++it) { std::vector children = it->second->getChildren(level); double dist = it->first; if(distsecond; } if(dist <= sep) { Qj.push_back(*it); } typename std::vector::const_iterator it2; for(it2=children.begin();it2!=children.end();++it2) { dist = p.distance((*it2)->getPoint()); if(distsecond; } if(dist <= sep) { Qj.push_back(std::make_pair(dist,*it2)); } } } if(level>_minLevel) remove_rec(p,coverSets,level-1,multi); if(minNode->hasPoint(p)) { //the multi flag indicates the point we removed is from a //node containing multiple points, and we have removed it, //so we don't need to do anything else. if(multi) return; if(!minNode->isSingle()) { minNode->removePoint(p); multi=true; return; } if(parent!=NULL) parent->removeChild(level, minNode); std::vector children = minNode->getChildren(level-1); std::vector& Q = coverSets[level-1]; if(Q.size()==1 && Q[0].second==minNode) { Q.pop_back(); } else { for(unsigned int i=0;i::const_iterator it; for(it=children.begin();it!=children.end();++it) { int i = level-1; Point q = (*it)->getPoint(); double minDQ = DBL_MAX; CoverTreeNode* minDQNode; double sep = pow(base,i); bool br=false; while(true) { std::vector& Q = coverSets[i]; typename std::vector::const_iterator it2; minDQ = DBL_MAX; for(it2=Q.begin();it2!=Q.end();++it2) { double d = q.distance(it2->second->getPoint()); if(dsecond; if(d <=sep) { br=true; break; } } } minDQ=DBL_MAX; if(br) break; Q.push_back(std::make_pair((*it)->distance(p),*it)); i++; sep = pow(base,i); } //minDQNode->getPoint().print(); //std::cout << " is level " << i << " parent of "; //(*it)->getPoint().print(); minDQNode->addChild(i,*it); } if(parent!=NULL) { delete minNode; _numNodes--; } } } template std::pair::CoverTreeNode*> CoverTree::distance(const Point& p, const std::vector& Q) { double minDist = DBL_MAX; CoverTreeNode* minNode; typename std::vector::const_iterator it; for(it=Q.begin();it!=Q.end();++it) { double dist = p.distance((*it)->getPoint()); if(dist < minDist) { minDist = dist; minNode = *it; } } return std::make_pair(minDist,minNode); } template void CoverTree::insert(const Point& newPoint) { if(_root==NULL) { _root = new CoverTreeNode(newPoint); _numNodes=1; return; } //TODO: this is pretty inefficient, there may be a better way //to check if the node already exists... CoverTreeNode* n = kNearestNodes(newPoint,1)[0]; if(newPoint.distance(n->getPoint())==0.0) { n->addPoint(newPoint); } else { //insert_rec acts under the assumption that there are no nodes with //distance 0 to newPoint in the cover tree (the previous lines check it) insert_rec(newPoint, std::vector (1,std::make_pair(_root->distance(newPoint),_root)), _maxLevel); } } template void CoverTree::remove(const Point& p) { //Most of this function's code is for the special case of removing the root if(_root==NULL) return; bool removingRoot=_root->hasPoint(p); if(removingRoot && !_root->isSingle()) { _root->removePoint(p); return; } CoverTreeNode* newRoot=NULL; if(removingRoot) { if(_numNodes==1) { //removing the last node... delete _root; _numNodes--; _root=NULL; return; } else { for(int i=_maxLevel;i>_minLevel;i--) { if(!(_root->getChildren(i).empty())) { newRoot = _root->getChildren(i).back(); _root->removeChild(i,newRoot); break; } } } } std::map > coverSets; coverSets[_maxLevel].push_back(std::make_pair(_root->distance(p),_root)); if(removingRoot) coverSets[_maxLevel].push_back(std::make_pair(newRoot->distance(p),newRoot)); bool multi = false; remove_rec(p,coverSets,_maxLevel,multi); if(removingRoot) { delete _root; _numNodes--; _root=newRoot; } } template std::vector CoverTree::kNearestNeighbors(const Point& p, const unsigned int& k) const { if(_root==NULL) return std::vector(); std::vector v = kNearestNodes(p, k); std::vector kNN; typename std::vector::const_iterator it; for(it=v.begin();it!=v.end();++it) { const std::vector& p = (*it)->getPoints(); kNN.insert(kNN.end(),p.begin(),p.end()); if(kNN.size() >= k) break; } return kNN; } template std::vector> CoverTree::kNearestNeighborDists(const Point& p, const unsigned int& k) const { std::vector> kNN; for (distNodePair const& dist_node : kNearestNodeSet(p, k)) { //Rcpp::Rcout << "dist:" << dist_node.first << ", point:" << dist_node.second->getPoint() << " "; kNN.push_back(std::make_pair(dist_node.first, dist_node.second->getPoint())); } //Rcpp::Rcout << std::endl; return kNN; } template void CoverTree::print() const { int d = _maxLevel-_minLevel+1; std::vector Q; Q.push_back(_root); for(int i=0;i::const_iterator it; for(it=Q.begin();it!=Q.end();++it) { (*it)->getPoint().print(); std::vector children = (*it)->getChildren(_maxLevel-i); typename std::vector::const_iterator it2; for(it2=children.begin();it2!=children.end();++it2) { std::cout << " "; (*it2)->getPoint().print(); } } std::vector newQ; for(it=Q.begin();it!=Q.end();++it) { std::vector children = (*it)->getChildren(_maxLevel-i); newQ.insert(newQ.end(),children.begin(),children.end()); } Q.insert(Q.end(),newQ.begin(),newQ.end()); std::cout << "\n\n"; } } template typename CoverTree::CoverTreeNode* CoverTree::getRoot() const { return _root; } template CoverTree::CoverTreeNode::CoverTreeNode(const Point& p) { _points.push_back(p); } template std::vector::CoverTreeNode*> CoverTree::CoverTreeNode::getChildren(int level) const { typename std::map >::const_iterator it = _childMap.find(level); if(it!=_childMap.end()) { return it->second; } return std::vector(); } template void CoverTree::CoverTreeNode::addChild(int level, CoverTreeNode* p) { _childMap[level].push_back(p); } template void CoverTree::CoverTreeNode::removeChild(int level, CoverTreeNode* p) { std::vector& v = _childMap[level]; for(unsigned int i=0;i void CoverTree::CoverTreeNode::addPoint(const Point& p) { if(find(_points.begin(), _points.end(), p) == _points.end()) _points.push_back(p); } template void CoverTree::CoverTreeNode::removePoint(const Point& p) { typename std::vector::iterator it = find(_points.begin(), _points.end(), p); if(it != _points.end()) _points.erase(it); } template double CoverTree::CoverTreeNode::distance(const CoverTreeNode& p) const { return _points[0].distance(p.getPoint()); } template bool CoverTree::CoverTreeNode::isSingle() const { return _points.size() == 1; } template bool CoverTree::CoverTreeNode::hasPoint(const Point& p) const { return find(_points.begin(), _points.end(), p) != _points.end(); } template const Point& CoverTree::CoverTreeNode::getPoint() const { return _points[0]; } template std::vector::CoverTreeNode*> CoverTree::CoverTreeNode::getAllChildren() const { std::vector children; typename std::map >::const_iterator it; for(it=_childMap.begin();it!=_childMap.end();++it) { children.insert(children.end(), it->second.begin(), it->second.end()); } return children; } template bool CoverTree::isValidTree() const { if(_numNodes==0) return _root==NULL; std::vector nodes; nodes.push_back(_root); for(int i=_maxLevel;i>_minLevel;i--) { double sep = pow(base,i); typename std::vector::const_iterator it, it2; //verify separation invariant of cover tree: for each level, //every point is farther than base^level away for(it=nodes.begin(); it!=nodes.end(); ++it) { for(it2=nodes.begin(); it2!=nodes.end(); ++it2) { double dist=(*it)->distance((*it2)->getPoint()); if(dist<=sep && dist!=0.0) { std::cout << "Level " << i << " Separation invariant failed.\n"; return false; } } } std::vector allChildren; for(it=nodes.begin(); it!=nodes.end(); ++it) { std::vector children = (*it)->getChildren(i); //verify covering tree invariant: the children of node n at level //i are no further than base^i away for(it2=children.begin(); it2!=children.end(); ++it2) { double dist = (*it2)->distance((*it)->getPoint()); if(dist>sep) { std::cout << "Level" << i << " covering tree invariant failed.n"; return false; } } allChildren.insert (allChildren.end(),children.begin(),children.end()); } nodes.insert(nodes.begin(),allChildren.begin(),allChildren.end()); } return true; } #endif // _COVER_TREE_H knn.covertree/src/utils.h0000644000176200001440000000035313522503623015137 0ustar liggesusers#ifndef _UTILS_H #define _UTILS_H #include using namespace Rcpp; double cor(const NumericVector v1, const NumericVector v2); IntegerVector rank(const NumericVector x); NumericMatrix rank_mat(const NumericMatrix x); #endif knn.covertree/src/utils.cpp0000644000176200001440000000164113522503623015473 0ustar liggesusers#include "utils.h" using namespace Rcpp; double cor(const NumericVector v1, const NumericVector v2) { const int n = v1.size(); if (n != v2.size()) stop("v1 needs to be of same size as v2"); double v1sum, v2sum, v12sum, v1sqr_sum, v2sqr_sum; v1sum = v2sum = v12sum = v1sqr_sum = v2sqr_sum = 0; for (int i=0; i #include #include "cover_tree.h" #include "utils.h" #include "exports.h" using namespace Rcpp; template class IndexedPoint { private: NumericVector _vec; size_t _idx; public: IndexedPoint(NumericVector v, size_t i) : _vec(v), _idx(i) {} const NumericVector& vec() const { return this->_vec; } size_t idx() const { return this->_idx; } bool operator==(const IndexedPoint& p) { return is_true(all(this->_vec == p.vec())); }; double distance(const IndexedPoint& p) const { return Distance::distance(*this, p); }; }; template std::ostream &operator<<(std::ostream& os, IndexedPoint const& m) { return os << "IndexedPoint(" << m.idx() << ", " << m.vec() << ")"; } class EuclideanDistance { public: static double distance(const IndexedPoint& p1, const IndexedPoint& p2) { const NumericVector& vec1 = p1.vec(); const NumericVector& vec2 = p2.vec(); double dist = 0; const size_t lim = vec1.size(); for (size_t i = 0; i& p1, const IndexedPoint& p2) { return 1 - cor(p1.vec(), p2.vec()); } }; template List knn_cross_impl(const NumericMatrix data, const NumericMatrix query, const size_t k, const size_t skip_self = 0) { if (data.ncol() != query.ncol()) stop("data and query need the same number of features"); const size_t nsmp_data = data.nrow(); const size_t nsmp_query = query.nrow(); std::vector> points; points.reserve(nsmp_data); for (size_t s=0; s(data(s, _), s)); } double max = 1e10;//std::numeric_limits::infinity(); CoverTree> tree(max, points); IntegerMatrix index(nsmp_query, k); NumericMatrix dists(nsmp_query, k); std::fill(index.begin(), index.end(), NA_INTEGER); std::fill(dists.begin(), dists.end(), NA_REAL); typedef Eigen::Triplet T; std::vector triplets; triplets.reserve(2 * nsmp_query * k); typedef std::pair> DP; for (size_t s=0; s p(query(s, _), s); const std::vector nns = tree.kNearestNeighborDists(p, k + skip_self); //Rcout << "#nn:" << nns.size() << ", k:" << k << std::endl; typename std::vector::const_iterator nn; for (nn = nns.begin() + skip_self; nnfirst << ", idx:" << nn->second.idx() << ", vec:" << nn->second.vec() << std::endl; const size_t n = nn - nns.begin() - skip_self; const double dist = nn->first; const size_t nn_idx = nn->second.idx(); index(s, n) = nn_idx + 1; // R index dists(s, n) = dist; triplets.push_back(T(s, nn_idx, dist)); } if (s%100 == 0) Rcpp::checkUserInterrupt(); } Eigen::SparseMatrix dist_mat(nsmp_query, nsmp_data); dist_mat.setFromTriplets(triplets.begin(), triplets.end()); List ret; ret["index"] = index; ret["dist"] = dists; ret["dist_mat"] = dist_mat; return ret; } template List knn_impl(const NumericMatrix data, const size_t k) { return knn_cross_impl(data, data, k, 1); } // [[Rcpp::export]] List knn_cross(const NumericMatrix data, const NumericMatrix query, const size_t k, const std::string distance = "euclidean") { if (distance == "euclidean") { return knn_cross_impl(data, query, k); } else if (distance == "cosine") { return knn_cross_impl(data, query, k); } else if (distance == "rankcor") { NumericMatrix data_rank = rank_mat(data); NumericMatrix query_rank = rank_mat(query); return knn_cross_impl(data_rank, query_rank, k); } else stop("Unknown distance specified"); } // [[Rcpp::export]] List knn_asym(const NumericMatrix data, const size_t k, const std::string distance = "euclidean") { if (distance == "euclidean") { return knn_impl(data, k); } else if (distance == "cosine") { return knn_impl(data, k); } else if (distance == "rankcor") { NumericMatrix data_rank = rank_mat(data); return knn_impl(data_rank, k); } else stop("Unknown distance specified"); } knn.covertree/src/RcppExports.cpp0000644000176200001440000000443113522503623016624 0ustar liggesusers// Generated by using Rcpp::compileAttributes() -> do not edit by hand // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 #include #include using namespace Rcpp; // knn_cross List knn_cross(const NumericMatrix data, const NumericMatrix query, const size_t k, const std::string distance); RcppExport SEXP _knn_covertree_knn_cross(SEXP dataSEXP, SEXP querySEXP, SEXP kSEXP, SEXP distanceSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const NumericMatrix >::type data(dataSEXP); Rcpp::traits::input_parameter< const NumericMatrix >::type query(querySEXP); Rcpp::traits::input_parameter< const size_t >::type k(kSEXP); Rcpp::traits::input_parameter< const std::string >::type distance(distanceSEXP); rcpp_result_gen = Rcpp::wrap(knn_cross(data, query, k, distance)); return rcpp_result_gen; END_RCPP } // knn_asym List knn_asym(const NumericMatrix data, const size_t k, const std::string distance); RcppExport SEXP _knn_covertree_knn_asym(SEXP dataSEXP, SEXP kSEXP, SEXP distanceSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const NumericMatrix >::type data(dataSEXP); Rcpp::traits::input_parameter< const size_t >::type k(kSEXP); Rcpp::traits::input_parameter< const std::string >::type distance(distanceSEXP); rcpp_result_gen = Rcpp::wrap(knn_asym(data, k, distance)); return rcpp_result_gen; END_RCPP } // rank_mat NumericMatrix rank_mat(const NumericMatrix x); RcppExport SEXP _knn_covertree_rank_mat(SEXP xSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< const NumericMatrix >::type x(xSEXP); rcpp_result_gen = Rcpp::wrap(rank_mat(x)); return rcpp_result_gen; END_RCPP } static const R_CallMethodDef CallEntries[] = { {"_knn_covertree_knn_cross", (DL_FUNC) &_knn_covertree_knn_cross, 4}, {"_knn_covertree_knn_asym", (DL_FUNC) &_knn_covertree_knn_asym, 3}, {"_knn_covertree_rank_mat", (DL_FUNC) &_knn_covertree_rank_mat, 1}, {NULL, NULL, 0} }; RcppExport void R_init_knn_covertree(DllInfo *dll) { R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); R_useDynamicSymbols(dll, FALSE); } knn.covertree/R/0000755000176200001440000000000013540637506013247 5ustar liggesusersknn.covertree/R/knn.covertree-package.r0000644000176200001440000000051013522503623017572 0ustar liggesusers#' A not-too-fast but accurate kNN implementation supporting multiple distance measures #' #' @docType package #' @rdname knn.covertree-package #' @name knn.covertree #' #' @useDynLib knn.covertree ## Make sure Rcpp and RcppEigen are loaded #' @importFrom Rcpp evalCpp #' @importFrom RcppEigen RcppEigen.package.skeleton NULL knn.covertree/R/knn.r0000644000176200001440000000674513540407573014233 0ustar liggesusers#' kNN search #' #' k nearest neighbor search with custom distance function. #' #' @param data Data matrix #' @param query Query matrix. In \code{knn} and \code{knn_asym}, query and data are identical #' @param k Number of nearest neighbors #' @param ... Unused. All parameters to the right of the \code{...} have to be specified by name (e.g. \code{find_knn(data, k, distance = 'cosine')}) #' @param distance Distance metric to use. Allowed measures: Euclidean distance (default), cosine distance (\eqn{1-corr(c_1, c_2)}) or rank correlation distance (\eqn{1-corr(rank(c_1), rank(c_2))}) #' @param sym Return a symmetric matrix (as long as query is NULL)? #' #' @return A \code{\link{list}} with the entries: #' \describe{ #' \item{\code{index}}{A \eqn{nrow(data) \times k} \link{integer} \link{matrix} containing the indices of the k nearest neighbors for each cell.} #' \item{\code{dist}}{A \eqn{nrow(data) \times k} \link{double} \link{matrix} containing the distances to the k nearest neighbors for each cell.} #' \item{\code{dist_mat}}{ #' A \code{\link[Matrix:dgCMatrix-class]{dgCMatrix}} if \code{sym == TRUE}, #' else a \code{\link[Matrix:dsCMatrix-class]{dsCMatrix}} (\eqn{nrow(query) \times nrow(data)}). #' Any zero in the matrix (except for the diagonal) indicates that the cells in the corresponding pair are close neighbors. #' } #' } #' #' @examples #' # The default: symmetricised pairwise distances between all rows #' pairwise <- find_knn(mtcars, 5L) #' image(as.matrix(pairwise$dist_mat)) #' #' # Nearest neighbors of a subset within all #' mercedeses <- grepl('Merc', rownames(mtcars)) #' merc_vs_all <- find_knn(mtcars, 5L, query = mtcars[mercedeses, ]) #' # Replace row index matrix with row name matrix #' matrix( #' rownames(mtcars)[merc_vs_all$index], #' nrow(merc_vs_all$index), #' dimnames = list(rownames(merc_vs_all$index), NULL) #' )[, -1] # 1st nearest neighbor is always the same row #' #' @rdname knn #' @export find_knn <- function(data, k, ..., query = NULL, distance = c('euclidean', 'cosine', 'rankcor'), sym = TRUE) { chkDots(...) if (is.null(dim(data))) stop('data needs to be a ') data <- to_matrix(data) distance <- match.arg(distance) if (is.null(query)) { knn <- knn_asym(data, k, distance) if (sym) knn$dist_mat <- symmetricise(knn$dist_mat) nms <- rownames(data) } else { knn <- knn_cross(data, to_matrix(query), k, distance) nms <- rownames(query) } rownames(knn$dist_mat) <- rownames(knn$index) <- rownames(knn$dist) <- nms colnames(knn$dist_mat) <- rownames(data) knn } #' @importFrom methods is to_matrix <- function(x) { name <- deparse(substitute(x)) if (is.double(x)) return(x) if (is.data.frame(x)) { if (!all(sapply(x, is.numeric))) stop('find_knn parameter "', name, '" only works on numeric data.frames.') } else if (!is.integer(x)) { type <- if (is(x, 'sparseMatrix')) 'sparse matrices' else class(x)[[1L]] warning('find_knn parameter "', name, '" does not specifically support ', type, ', converting data to a dense matrix.') } as.matrix(x) } # (double generic columnsparse to ... symmetric ...: dgCMatrix -> dsCMatrix) # retain all differences fully. symmpart halves them in the case of trans_p[i,j] == 0 && trans_p[j,i] > 0 # TODO: could be more efficient #' @importFrom methods as #' @importFrom Matrix symmpart skewpart forceSymmetric symmetricise <- function(dist_asym) { dist_sym <- symmpart(dist_asym) + abs(forceSymmetric(skewpart(dist_asym), 'U')) as(dist_sym, 'symmetricMatrix') } knn.covertree/R/RcppExports.R0000644000176200001440000000101313540637506015656 0ustar liggesusers# Generated by using Rcpp::compileAttributes() -> do not edit by hand # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 knn_cross <- function(data, query, k, distance = "euclidean") { .Call('_knn_covertree_knn_cross', PACKAGE = 'knn.covertree', data, query, k, distance) } knn_asym <- function(data, k, distance = "euclidean") { .Call('_knn_covertree_knn_asym', PACKAGE = 'knn.covertree', data, k, distance) } rank_mat <- function(x) { .Call('_knn_covertree_rank_mat', PACKAGE = 'knn.covertree', x) } knn.covertree/MD50000644000176200001440000000143613555610202013350 0ustar liggesusers157d1419552b7527dc7f5f0e9ee6d5b3 *DESCRIPTION 1817f54bc5554ec9cd61537718953e07 *NAMESPACE 4a9445f52d1e6dff4e39fb9297904907 *R/RcppExports.R e53411f9f096a70c76fac623433a1f66 *R/knn.covertree-package.r e9012345890c1dcfdf1f2ba89dbfa86e *R/knn.r b511ad36b3e4b47445f74356794733ad *README.md 8d92b03127a2da97be6290ae46612a45 *man/knn.Rd 5bfe505b2b3085150e550cdebbcfdafa *man/knn.covertree-package.Rd a7eec3243d1807eb7d460df2a5a72a1a *src/RcppExports.cpp 00f487b535a77c75975b196edfe98bca *src/cover_tree.h 470cc9553f96844767cb1639c6ed5f86 *src/exports.h 2939f726d9bf09902c5d98f73d028e9b *src/knn.cpp 67fa96308c5b47620210af77631cd9f0 *src/utils.cpp e990a0f4c7c6e6f84e53bfea0c89b0e6 *src/utils.h 37f56ff95cb7fd99ee7190732a52e6b5 *tests/testthat.R ba73351107ff17fa2febd842854db730 *tests/testthat/test-knn.r