gower/0000755000176200001440000000000013053661645011406 5ustar liggesusersgower/inst/0000755000176200001440000000000013053641135012353 5ustar liggesusersgower/inst/doc/0000755000176200001440000000000013053641135013120 5ustar liggesusersgower/inst/doc/intro.html0000644000176200001440000003562013053641135015147 0ustar liggesusers Introduction to the gower package

Introduction to the gower package

Mark van der Loo

2017-02-23

Gower’s distance measure

Gower’s distance can be used to measure how different two records are. The records may contain combinations of logical, numerical, categorical or text data. The distance is always a number between 0 (identical) and 1 (maximally dissimilar). An easy to read specification of the measure is given in the original paper.

Gower (1971) A general coefficient of similarity and some of its properties. Biometrics 27 857-874.

In short, Gower’s distance (or similarity) first computes distances between pairs of variables over two data sets and then combines those distances to a single value per record-pair.

This package modifies Gower’s original similarity measure in the following ways.

Computing Gower’s distance

The function gower_dist computes pairwise-distances between records.

library(gower)
dat1 <- iris[1:10,]
dat2 <- iris[6:15,]
gower_dist(dat1, dat2)
##  [1] 0.34606061 0.17939394 0.14303030 0.09636364 0.20424242 0.23636364
##  [7] 0.16000000 0.19939394 0.19818182 0.45030303

If one data frame has less records than the other, the shortest one is recycled over (just like when you’re adding two vectors of unequal length)

gower_dist(iris[1,], dat1)
##  [1] 0.0000000 0.1400000 0.1900000 0.2300000 0.0400000 0.4233333 0.1866667
##  [8] 0.0900000 0.2600000 0.2366667

It is possible to control how columns from the two data sets are paired for comparison using the pair_x and pair_y arguments. This comes in handy when similar colums have different names accross datasets. By default, columns with matching names are paired. The behaviour is somewhat similar to that of base R’s merge in that respect.

dat1 <- dat2 <- iris[1:10,]
names(dat2) <- tolower(names(dat2))
gower_dist(dat1, dat2)
##  [1] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# tell gower_dist to match columns 1..5 in dat1 with column 1..5 in dat2
gower_dist(dat1, dat2, pair_y=1:5)
##  [1] 0 0 0 0 0 0 0 0 0 0

Computing the top-n matches

The function gower_topn returns a list with two arrays.

dat1 <- iris[1:10,]
L <- gower_topn(x=dat1, y=iris, n=3)
L
## $index
##       row
## topn   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##   [1,]    1    2    3    4    5    6    7    8    9    10
##   [2,]   18   35   48   48   38   17   48   40   39    35
##   [3,]   28   26   30   31    1   45   12   50   43    31
## 
## $distance
##       row
## topn          [,1]       [,2]        [,3]       [,4]       [,5]       [,6]
##   [1,] 0.000000000 0.00000000 0.000000000 0.00000000 0.00000000 0.00000000
##   [2,] 0.008333333 0.01172316 0.008945386 0.01172316 0.01388889 0.01355932
##   [3,] 0.008945386 0.01233522 0.010169492 0.01450094 0.01388889 0.03177966
##       row
## topn         [,7]        [,8]       [,9]       [,10]
##   [1,] 0.00000000 0.000000000 0.00000000 0.000000000
##   [2,] 0.02500000 0.005555556 0.01172316 0.008333333
##   [3,] 0.02622411 0.011723164 0.02838983 0.017278719

The first array is called index. Each column corresponds to one row of x. The entries of each column index the top n best matches of that row in x with rows in y. In this example, the best match of the first row of dat1 is record number 1 from iris (this should be obvious, since they are the same record). The second best match is record number 18 from iris.

The second array is called distance and it contains the corresponding distances.

Parallelization, memory usage

The underlying algorithm is implemented in C and parallelized using OpenMP. OpenMP is available on most systems that can run R. Please see this section of the writing R extensions manual for up-to-details details on which systems are supported. At the time of writing (summer 2016), OSX is the only system not supporting OpenMP out of the box. You can still make it work by installing the gcc toolchain and compiling the package (and R).

If OpenMP is not supported, the package will still work but the core algorithms will not be parallelized.

This implementation makes no copies of the data in memory. When computing gower_dist, two double precision arrays of size max(nrow(x),nrow(y)) are kept in memory to store intermediate results. When computing the top-n matches, for k cores, k+2 double precision arrays of length nrow(y) are created to store intermediate results at C level.

gower/inst/doc/intro.R0000644000176200001440000000123313053641135014375 0ustar liggesusers## ------------------------------------------------------------------------ library(gower) dat1 <- iris[1:10,] dat2 <- iris[6:15,] gower_dist(dat1, dat2) ## ------------------------------------------------------------------------ gower_dist(iris[1,], dat1) ## ------------------------------------------------------------------------ dat1 <- dat2 <- iris[1:10,] names(dat2) <- tolower(names(dat2)) gower_dist(dat1, dat2) # tell gower_dist to match columns 1..5 in dat1 with column 1..5 in dat2 gower_dist(dat1, dat2, pair_y=1:5) ## ------------------------------------------------------------------------ dat1 <- iris[1:10,] L <- gower_topn(x=dat1, y=iris, n=3) L gower/inst/doc/intro.Rmd0000644000176200001440000000760613004411462014722 0ustar liggesusers--- title: "Introduction to the gower package" author: "Mark van der Loo" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to the gower package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Gower's distance measure Gower's distance can be used to measure how different two records are. The records may contain combinations of logical, numerical, categorical or text data. The distance is always a number between 0 (identical) and 1 (maximally dissimilar). An easy to read specification of the measure is given in the original paper. Gower (1971) [A general coefficient of similarity and some of its properties.](http://venus.unive.it/romanaz/modstat_ba/gowdis.pdf) _Biometrics_ **27** 857-874. In short, Gower's distance (or similarity) first computes distances between pairs of variables over two data sets and then combines those distances to a single value per record-pair. This package modifies Gower's original similarity measure in the following ways. - In stead of the original similarity _S_, the package returns the distance _1-S_. - The original paper does not mention the concept of `NA`. Missing variables are skipped when computing the distance. - The original paper does not mention character data. These are treated as categorical data. ## Computing Gower's distance The function ```gower_dist``` computes pairwise-distances between records. ```{r} library(gower) dat1 <- iris[1:10,] dat2 <- iris[6:15,] gower_dist(dat1, dat2) ``` If one data frame has less records than the other, the shortest one is recycled over (just like when you're adding two vectors of unequal length) ```{r} gower_dist(iris[1,], dat1) ``` It is possible to control how columns from the two data sets are paired for comparison using the `pair_x` and `pair_y` arguments. This comes in handy when similar colums have different names accross datasets. By default, columns with matching names are paired. The behaviour is somewhat similar to that of base R's `merge` in that respect. ```{r} dat1 <- dat2 <- iris[1:10,] names(dat2) <- tolower(names(dat2)) gower_dist(dat1, dat2) # tell gower_dist to match columns 1..5 in dat1 with column 1..5 in dat2 gower_dist(dat1, dat2, pair_y=1:5) ``` ## Computing the top-n matches The function `gower_topn` returns a list with two arrays. ```{r} dat1 <- iris[1:10,] L <- gower_topn(x=dat1, y=iris, n=3) L ``` The first array is called `index`. Each column corresponds to one row of `x`. The entries of each column index the top _n_ best matches of that row in x with rows in `y`. In this example, the best match of the first row of `dat1` is record number ```r L$index[1,1]``` from `iris` (this should be obvious, since they are the same record). The second best match is record number ```r L$index[2,1]``` from `iris`. The second array is called `distance` and it contains the corresponding distances. ## Parallelization, memory usage The underlying algorithm is implemented in C and parallelized using [OpenMP](http://www.openmp.org). OpenMP is available on most systems that can run R. Please see [this section](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#OpenMP-support) of the writing R extensions manual for up-to-details details on which systems are supported. At the time of writing (summer 2016), OSX is the only system not supporting OpenMP out of the box. You can still make it work by installing the gcc toolchain and compiling the package (and R). If OpenMP is not supported, the package will still work but the core algorithms will not be parallelized. This implementation makes no copies of the data in memory. When computing `gower_dist`, two double precision arrays of size _max(nrow(x),nrow(y))_ are kept in memory to store intermediate results. When computing the top-n matches, for _k_ cores, _k+2_ double precision arrays of length ```nrow(y)``` are created to store intermediate results at C level. gower/tests/0000755000176200001440000000000013004411462012532 5ustar liggesusersgower/tests/testthat.R0000644000176200001440000000005713004411462014517 0ustar liggesusersif ( require(testthat) ) test_check("gower") gower/tests/testthat/0000755000176200001440000000000013053661645014410 5ustar liggesusersgower/tests/testthat/test_gower.R0000644000176200001440000000635013004411462016703 0ustar liggesusers context("Basic distance elements") test_that("distance between logicals",{ dL <- expand.grid(c(TRUE,FALSE),c(TRUE,FALSE)) x = data.frame(x=dL[,1]) y = data.frame(x=dL[,2]) expect_equal(gower_dist(x = x,y = y),c(0,1,1,NaN)) }) test_that("distance between factor variables",{ bands <- c("Grand Magus","Skull Fist") dF <- expand.grid(bands,bands) expect_equal(gower_dist(data.frame(x=dF[,1]),data.frame(x=dF[,2])),c(0,1,1,0)) }) test_that("distance between numerical variables",{ dN <- data.frame(x = as.numeric(1:4),y=as.numeric(c(1,1,2,3))) expect_equal(gower_dist(data.frame(x=dN[,1]),data.frame(x=dN[,2])),c(0,1/3,1/3,1/3)) }) test_that("distance between character variables",{ dC <- data.frame(x=letters[1:3],y=letters[3:1],stringsAsFactors=FALSE) expect_equal(gower_dist( data.frame(x=dC[,1],stringsAsFactors=FALSE) , data.frame(x=dC[,2],stringsAsFactors=FALSE)),c(1,0,1)) }) test_that("multivariate dataset",{ bands <- c("Grand Magus","Skull Fist") dL <- expand.grid(c(TRUE,FALSE),c(TRUE,FALSE)) dN <- data.frame(x = as.numeric(1:4),y=as.numeric(c(1,1,2,3))) dF <- expand.grid(bands,bands) dM1 <- data.frame(x=dL[,1],y=dF[,1],z=dN[,1]) dM2 <- data.frame(x=dL[,2],y=dF[,2],z=dN[,2]) expect_equal(gower_dist(x=dM1,y=dM2), c(0,7/9,7/9,1/6)) # check symmetry expect_equal(gower_dist(dM1,dM2),gower_dist(dM2,dM1)) # not counting NA's in the denominator dM1[array(c(2,3,4,1,2,3),dim=c(3,2))] <- NA expect_equal(gower_dist(dM1,dM2), c(0,3/4,3/4,0)) }) test_that("recycling",{ expect_equal(length(gower_dist(x=iris[1,],y=iris)), nrow(iris)) expect_equal(length(gower_dist(x=iris,y=iris[1,])), nrow(iris)) expect_equal(length(gower_dist(x=iris[1:3,],y=iris)), nrow(iris)) expect_equal(length(gower_dist(x=iris,y=iris[1:3,])), nrow(iris)) }) test_that("exceptions",{ expect_warning(gower_dist( x = data.frame(x=c(1.2,1.2,1.2)) , y = data.frame(x=c(1.2,1.2,1.2)) )) expect_warning(gower_dist( x = data.frame(x=c(1.2,1.2,1.2)) , y = data.frame(x=c(1.2,1.2,1.3)) , eps=0.2 )) expect_warning(gower_dist(data.frame(x=rep(1,100)), data.frame(x=1,100))) }) context("Top-n") test_that("gower_topn",{ d1 <- iris[1:3,] d2 <- iris[1:7,] L <- gower_topn(d1,d2,n=4) expect_equal(length(L),2) expect_equal(dim(L[[1]]),c(4,3)) expect_equal(dim(L[[1]]),dim(L[[2]])) expect_equal(L[[1]][1,],1:3) expect_equal(L[[2]][1,],rep(0,3)) # case where n exceeds nr of records in the lookup table. L <- gower_topn(d1,d2,n=8) expect_equal(L[[1]][8,],rep(0,3)) expect_equal(L[[2]][8,],rep(Inf,3)) }) test_that("just to get code-coverage right",{ dat1 <- data.frame( x = as.factor(sample(letters,2000,replace=TRUE)) ,y = sample(LETTERS,2000,replace=TRUE) ,z = as.integer(1:2000) ,w = sample(c(TRUE,FALSE),2000,replace=TRUE) , stringsAsFactors=FALSE ) i <- sample(2000) dat2 <- dat1[i,] gower_dist(dat1,dat2) }) test_that("warning on recycling",{ expect_warning(gower_dist(iris[1:3,],iris[1:2,])) expect_warning(gower_dist(iris[1:2,],iris[1:3,])) }) context("regression tests") test_that("topn w/n=1",{ dat <- data.frame(x = c(NA,2,4,5), y = c(6,7,NA,10)) L <- gower_topn(dat[c(1,3),],dat[c(2,4),],n=1) expect_equivalent(as.vector(L$index),c(1,2)) }) gower/src/0000755000176200001440000000000013053641135012165 5ustar liggesusersgower/src/Makevars0000644000176200001440000000011013053641135013651 0ustar liggesusers PKG_CFLAGS = $(SHLIB_OPENMP_CFLAGS) PKG_LIBS = $(SHLIB_OPENMP_CFLAGS) gower/src/R_register_native.c0000644000176200001440000000170113053641135016003 0ustar liggesusers#include #include #include // for NULL #include /* FIXME: Check these declarations against the C/Fortran source code. */ /* .Call calls */ extern SEXP R_get_max_threads(); extern SEXP R_get_thread_limit(); extern SEXP R_get_xy_range(SEXP, SEXP, SEXP); extern SEXP R_gower(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP); extern SEXP R_gower_topn(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP); static const R_CallMethodDef CallEntries[] = { {"R_get_max_threads", (DL_FUNC) &R_get_max_threads, 0}, {"R_get_thread_limit", (DL_FUNC) &R_get_thread_limit, 0}, {"R_get_xy_range", (DL_FUNC) &R_get_xy_range, 3}, {"R_gower", (DL_FUNC) &R_gower, 7}, {"R_gower_topn", (DL_FUNC) &R_gower_topn, 8}, {NULL, NULL, 0} }; void R_init_gower(DllInfo *dll) { R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); R_useDynamicSymbols(dll, FALSE); } gower/src/gower.c0000644000176200001440000004426013053641135013462 0ustar liggesusers/* gower - a C/R implementation of Gower's similarity (or distance) measure. * Copyright (C) 2016 Mark van der Loo * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program. If not, see . * * You can contact the author at: mark _dot_ vanderloo _at_ gmail _dot_ com * */ #ifdef _OPENMP #include #endif /* R-windows-oldrel (3.2.x) uses gcc 4.6.3 which we need to detect */ #ifdef __GNUC__ #if __GNUC__ <= 4 && __GNUC_MINOR__ <= 6 #else #define HAS_REDUCTION #endif #endif #include #include #include #define MAX(X,Y) ((X) > (Y) ? (X) : (Y)) #define MIN(X,Y) ((X) < (Y) ? (X) : (Y)) #define RECYCLE(N, K) ((N) + 1L < (K) ? (N) + 1L : 0L ) // determine when something is numerically zero. static double EPS = 1e-8; // set max nr of threads to use static int NTHREAD = 1L; // recycling in parallel region. static inline int recycle(int i, int nthreads, int ni){ i += nthreads; if ( i >= ni ) i = (nthreads < ni) ? (i - ni) : (i % ni); return i; } SEXP R_get_max_threads(){ SEXP out = allocVector(INTSXP, 1L); PROTECT(out); INTEGER(out)[0] = 1L; #ifdef _OPENMP INTEGER(out)[0] = omp_get_max_threads(); #endif UNPROTECT(1); return out; } SEXP R_get_num_procs(){ SEXP out = allocVector(INTSXP, 1L); PROTECT(out); INTEGER(out)[0] = 1L; #ifdef _OPENMP INTEGER(out)[0] = omp_get_num_procs(); #endif UNPROTECT(1); return out; } SEXP R_get_thread_limit(){ SEXP out = allocVector(INTSXP, 1L); PROTECT(out); INTEGER(out)[0] = 1L; #ifdef _OPENMP INTEGER(out)[0] = omp_get_thread_limit(); #endif UNPROTECT(1); return out; } // presence or absence of a character. x and y are 0 (FALSE) or 1 (TRUE) static inline void gower_logi(int *x, int nx, int *y, int ny , double *num, double *den) { #pragma omp parallel num_threads(NTHREAD) { int nt = MAX(nx,ny); double dijk, sijk; int i = 0, j = 0; double *inum = num, *iden=den; int ID = 0, num_threads=1; #ifdef _OPENMP ID = omp_get_thread_num(); num_threads = omp_get_num_threads(); i = recycle(ID-num_threads, num_threads, nx); j = recycle(ID-num_threads, num_threads, ny); inum += ID; iden += ID; #endif for ( int k = ID; k < nt; k += num_threads, inum += num_threads, iden += num_threads){ dijk = (double) ((x[i] | y[j]) & !((x[i] == NA_INTEGER) | (y[j] == NA_INTEGER))); sijk = (dijk == 1.0) ? (double) (x[i] * y[j]) : 0.0; *inum += dijk * sijk; *iden += dijk; i = recycle(i, num_threads, nx); j = recycle(j, num_threads, ny); } } // end parallel region. } // equality of categorical variables, encoded as x, y in {1,2,...,N}. static inline void gower_cat(int *x, int nx, int *y, int ny , double *num, double *den) { #pragma omp parallel num_threads(NTHREAD) { int nt = MAX(nx,ny); double dijk, sijk; int i = 0, j = 0; double *inum = num, *iden=den; int ID = 0, num_threads=1; #ifdef _OPENMP ID = omp_get_thread_num(); num_threads = omp_get_num_threads(); i = recycle(ID-num_threads, num_threads, nx); j = recycle(ID-num_threads, num_threads, ny); inum += ID; iden += ID; #endif for ( int k = ID; k < nt; k += num_threads, inum += num_threads, iden += num_threads){ dijk = (double) !(x[i] == NA_INTEGER || y[j] == NA_INTEGER); sijk = (dijk==1.0) ? (double) (x[i] == y[j]) : 0.0; *inum += dijk * sijk; *iden += dijk; i = recycle(i, num_threads, nx); j = recycle(j, num_threads, ny); } } // end parallel region. } // strings. Treated as categories. static inline void gower_str(SEXP x, int nx, SEXP y, int ny, double *num, double *den){ #pragma omp parallel num_threads(NTHREAD) { int nt = MAX(nx, ny); double dijk, sijk; int i=0, j=0; double *inum = num, *iden=den; SEXP xi, yj; int ID = 0, num_threads=1; #ifdef _OPENMP ID = omp_get_thread_num(); num_threads = omp_get_num_threads(); i = recycle(ID-num_threads, num_threads, nx); j = recycle(ID-num_threads, num_threads, ny); inum += ID; iden += ID; #endif for ( int k = ID; k < nt; k += num_threads, inum += num_threads, iden += num_threads){ xi = STRING_ELT(x,i); yj = STRING_ELT(y,j); dijk = (double) !(xi == NA_STRING || yj == NA_STRING); sijk = (dijk==1.0) ? (double) (CHAR(xi) == CHAR(yj)) : 0.0; *inum += dijk * sijk; *iden += dijk; i = recycle(i, num_threads, nx); j = recycle(j, num_threads, ny); } } // end of parallel region } // comparison of numerical variables, by absolute difference divided by range. static inline void gower_num(double *x, int nx, double *y, int ny,double R , double *num, double *den) { if ( !isfinite(R) || R < EPS ){ warning("skipping variable with zero or non-finite range."); return; } #pragma omp parallel num_threads(NTHREAD) { int nt = MAX(nx,ny); double dijk, sijk; int i = 0, j = 0; double *inum = num, *iden=den; int ID = 0, num_threads=1; #ifdef _OPENMP ID = omp_get_thread_num(); num_threads = omp_get_num_threads(); i = recycle(ID-num_threads, num_threads, nx); j = recycle(ID-num_threads, num_threads, ny); inum += ID; iden += ID; #endif for ( int k = ID; k < nt; k += num_threads, inum += num_threads, iden += num_threads){ dijk = (double) (isfinite(x[i]) & isfinite(y[j])); sijk = (dijk==1.0) ? (1.0-fabs(x[i]-y[j])/R) : 0.0; (*inum) += dijk * sijk; (*iden) += dijk; i = recycle(i, num_threads, nx); j = recycle(j, num_threads, ny); } } // end of parallel region } static inline void gower_dbl_int(double *x, int nx, int *y, int ny,double R , double *num, double *den) { if ( !isfinite(R) || R < EPS ){ warning("skipping variable with zero or non-finite range\n"); return; } #pragma omp parallel num_threads(NTHREAD) { int nt = MAX(nx, ny); double dijk, sijk; int i=0, j=0; double *inum = num, *iden=den; int ID = 0, num_threads=1; #ifdef _OPENMP ID = omp_get_thread_num(); num_threads = omp_get_num_threads(); i = recycle(ID-num_threads, num_threads, nx); j = recycle(ID-num_threads, num_threads, ny); inum += ID; iden += ID; #endif for ( int k = ID; k < nt; k += num_threads, inum += num_threads, iden += num_threads){ dijk = (double) (isfinite(x[i]) & (y[j] != NA_INTEGER)); sijk = (dijk==1.0) ? (1.0-fabs(x[i] - ((double) y[j]) )/R) : 0.0; *inum += dijk * sijk; *iden += dijk; i = recycle(i, num_threads, nx); j = recycle(j, num_threads, ny); } } // end of parallel region } static inline void gower_int(int *x, int nx, int *y, int ny, double R , double *num, double *den) { if ( !isfinite(R) || R == 0 ){ warning("skipping variable with zero or non-finite range\n"); return; } #pragma omp parallel num_threads(NTHREAD) { int nt = MAX(nx, ny); double dijk, sijk; int i=0, j=0; double *inum = num, *iden=den; int ID = 0, num_threads=1; #ifdef _OPENMP ID = omp_get_thread_num(); num_threads = omp_get_num_threads(); i = recycle(ID-num_threads, num_threads, nx); j = recycle(ID-num_threads, num_threads, ny); inum += ID; iden += ID; #endif for ( int k = ID; k < nt; k += num_threads, inum += num_threads, iden += num_threads){ dijk = (double) ( (x[i] !=NA_INTEGER) & (y[j] != NA_INTEGER)); sijk = (dijk==1.0) ? (1.0-fabs( ((double)x[i]) - ((double)y[j]) )/R) : 0.0; *inum += dijk * sijk; *iden += dijk; i = recycle(i, num_threads, nx); j = recycle(j, num_threads, ny); } } // end of parallel region } // range computations static void get_dbl_range(double *x, int nx, double *min, double *max){ double *ix = x; double imin=*ix, imax=*ix; for ( int i=0; i imax){ imax = x[i]; } else if ( x[i] < imin ){ imin = x[i]; } } } }// end parallel region *min = imin; *max = imax; } static void get_int_range(int *x, int nx, double *min, double *max){ int *ix = x; int imin = x[0] , imax = x[0]; for ( int i=0; i imax){ imax = x[i]; } else if ( x[i] < imin ){ imin = x[i]; } } } *min = (double) imin; *max = (double) imax; } static void get_range(SEXP x, double *min, double *max){ switch( TYPEOF(x) ){ case INTSXP : { get_int_range(INTEGER(x), length(x), min, max); break; } case REALSXP : { get_dbl_range(REAL(x), length(x), min, max); break; } } } static double get_xy_range(SEXP x, SEXP y){ double x_min = R_NegInf , x_max = R_PosInf , y_min = R_NegInf , y_max = R_PosInf , min, max; get_range(x, &x_min, &x_max); get_range(y, &y_min, &y_max); if ( isfinite(x_min) & isfinite(y_min) ){ min = MIN(x_min, y_min); } else if ( isfinite(x_min) & !(isfinite(y_min)) ){ min = x_min; } else if ( (!isfinite(x_min)) & isfinite(y_min) ) { min = y_min; } else { min = NA_REAL; } if ( isfinite(x_max) & isfinite(y_min) ){ max = MAX(x_max, y_max); } else if ( isfinite(x_max) & !isfinite(y_max) ){ max = x_max; } else if ( (!isfinite(x_max)) & isfinite(y_max) ){ max = y_max; } else { max = NA_REAL; } return max - min; } SEXP R_get_xy_range(SEXP x_, SEXP y_, SEXP nthread_){ NTHREAD = INTEGER(nthread_)[0]; SEXP out = allocVector(REALSXP,1L); PROTECT(out); REAL(out)[0] = get_xy_range(x_, y_); UNPROTECT(1); return out; } /* static void print_vec(double *x, int n){ for ( int i=0; ii; j-- ){ topn[j] = topn[j-1]; index[j] = index[j-1]; } topn[i] = x; index[i] = ind; break; // out of main loop } } } } /* For testing purposes SEXP R_pushdown(SEXP entry_, SEXP index_, SEXP values_, SEXP indices_){ double entry = REAL(entry_)[0]; int index = INTEGER(index_)[0]; double *values = REAL(values_); int *indices = INTEGER(indices_); int n = length(values_); push(entry, index, values, indices, n); return R_NilValue; } */ static inline void copyrec(SEXP into, SEXP from, int i){ int ncol = length(into); SEXP col_from, col_into; for ( int j = 0; j < ncol; j++){ col_from = VECTOR_ELT(from,j); col_into = VECTOR_ELT(into,j); switch(TYPEOF(col_from)){ case LGLSXP : { INTEGER(col_into)[0] = INTEGER(col_from)[i]; break;} case INTSXP : { INTEGER(col_into)[0] = INTEGER(col_from)[i]; break;} case REALSXP : { REAL(col_into)[0] = REAL(col_from)[i]; break;} case STRSXP : { SET_STRING_ELT(col_from, 0, STRING_ELT(col_from,i)); break;} } } } /* for testing purposes only void prvec(SEXP x){ for (int i=0; i 2) max_threads <- max_threads - 1L options(gd_num_thread = as.integer(max_threads)) } gower/R/gower.R0000644000176200001440000001107613053636412013054 0ustar liggesusers #' Gower's distance #' #' Compute Gower's distance, pairwise between records in two data sets \code{x} #' and \code{y}. Records from the smallest data set are recycled over. #' #' @section Details: #' There are three ways to specify which columns of \code{x} should be compared #' with what columns of \code{y}. The first option is do give no specification. #' In that case columns with matching names will be used. The second option #' is to use only the \code{pairs_y} argument, specifying for each column in \code{x} #' in order, which column in \code{y} must be used to pair it with (use \code{0} #' to skip a column in \code{x}). The third option is to explicitly specify the #' columns to be matched using \code{pair_x} and \code{pair_y}. #' #' #' @section Note: #' Gower (1971) originally defined a similarity measure (\eqn{s}, say) #' with values ranging from 0 (completely dissimilar) to 1 (completely similar). #' The distance returned here equals \eqn{1-s}. #' #' #' @param x \code{[data.frame]} #' @param y \code{[data.frame]} #' @param pair_x \code{[numeric|character] (optional)} Columns in \code{x} used for comparison. #' See Details below. #' @param pair_y \code{[numeric|character] (optional)} Columns in \code{y} used for comparison. #' See Details below. #' @param eps \code{[numeric] (optional)} Computed numbers (variable ranges) #' smaller than \code{eps} are treated as zero. #' @param nthread Number of threads to use for parallelization. By default, #' for a dual-core machine, 2 threads are used. For any other machine #' n-1 cores are used so your machine doesn't freeze during a big computation. #' The maximum nr of threads are determined from \code{omp::get_max_threads}. #' #' #' @return #' A \code{numeric} vector of length \code{max(nrow(x),nrow(y))}. #' #' @seealso \code{\link{gower_topn}} #' #' @references #' Gower, John C. "A general coefficient of similarity and some of its #' properties." Biometrics (1971): 857-871. #' #' @export gower_dist <- function(x, y, pair_x=NULL, pair_y=NULL, eps = 1e-8 ,nthread=getOption("gd_num_thread")){ check_recycling(nrow(x),nrow(y)) check_recycling(nrow(x),nrow(y)) check_recycling(nrow(x),nrow(y)) check_recycling(nrow(x),nrow(y)) gower_work(x=x,y=y,pair_x=pair_x,pair_y=pair_y,n=NULL,eps=eps,nthread=nthread) } #' Find the top-n matches #' #' @description #' #' Find the top-n matches in \code{y} for each record in \code{x}. #' #' @inheritParams gower_dist #' @param n The top-n indices and distances to return. #' #' @seealso \code{\link{gower_dist}} #' #' @return #' A \code{list} with two array elements: \code{index} #' and \code{distance}. Both have size \code{n X nrow(x)}. Each ith column #' corresponds to the top-n best matches of \code{x} with rows in \code{y}. #' #' @examples #' # find the top 4 best matches in the iris data set with itself. #' x <- iris[1:3,] #' lookup <- iris[1:10,] #' gower_topn(x=x,y=lookup,n=4) #' #' #' @export gower_topn <- function(x, y, pair_x=NULL, pair_y = NULL, n=5, eps=1e-8 , nthread=getOption("gd_num_thread")){ gower_work(x=x,y=y,pair_x=pair_x,pair_y=pair_y,n=n,eps=eps,nthread) } gower_work <- function(x, y, pair_x, pair_y, n, eps, nthread){ stopifnot(is.numeric(eps), eps>0) if (is.null(pair_x) & is.null(pair_y)){ pair <- match(names(x),names(y),nomatch = 0L) } else if (is.null(pair_x)){ pair <- pair_y } else { if (is.character(pair_x) & is.character(pair_y)){ m <- match(names(x),pair_x,nomatch=0) pair_x <- pair_x[m] pair_y <- pair_y[m] } pair <- numeric(ncol(x)) pair[pair_x] <- pair_y } nthread <- as.integer(nthread) ranges <- numeric(length(pair)) for ( i in seq_along(pair)){ if (pair[i] == 0 ) next ranges[i] <- .Call("R_get_xy_range",x[[i]],y[[pair[i]]],nthread) } factor_pair <- as.integer(sapply(x,is.factor)) eps <- as.double(eps) # translate to C-indices (base-0). pair <- as.integer(pair-1L) if (is.null(n)){ .Call("R_gower", x, y , ranges, pair, factor_pair, eps, nthread) } else { L <- .Call("R_gower_topn", x, y, ranges, pair, factor_pair, as.integer(n), eps, nthread) names(L) <- c("index","distance") dim(L$index) <- c(n,nrow(x)) dim(L$distance) <- dim(L$index) dimnames(L$index) <- list(topn=NULL,row=NULL) dimnames(L$distance) <- dimnames(L$index) L } } RECYCLEWARNING <- tryCatch(1:3+1:2,warning=function(e) e$message) check_recycling <- function(nx,ny){ mx <- max(nx,ny) mn <- min(nx,ny) if ((mx %% mn) != 0) warning(RECYCLEWARNING, call.=FALSE) } gower/vignettes/0000755000176200001440000000000013053641135013406 5ustar liggesusersgower/vignettes/intro.Rmd0000644000176200001440000000760613004411462015210 0ustar liggesusers--- title: "Introduction to the gower package" author: "Mark van der Loo" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to the gower package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Gower's distance measure Gower's distance can be used to measure how different two records are. The records may contain combinations of logical, numerical, categorical or text data. The distance is always a number between 0 (identical) and 1 (maximally dissimilar). An easy to read specification of the measure is given in the original paper. Gower (1971) [A general coefficient of similarity and some of its properties.](http://venus.unive.it/romanaz/modstat_ba/gowdis.pdf) _Biometrics_ **27** 857-874. In short, Gower's distance (or similarity) first computes distances between pairs of variables over two data sets and then combines those distances to a single value per record-pair. This package modifies Gower's original similarity measure in the following ways. - In stead of the original similarity _S_, the package returns the distance _1-S_. - The original paper does not mention the concept of `NA`. Missing variables are skipped when computing the distance. - The original paper does not mention character data. These are treated as categorical data. ## Computing Gower's distance The function ```gower_dist``` computes pairwise-distances between records. ```{r} library(gower) dat1 <- iris[1:10,] dat2 <- iris[6:15,] gower_dist(dat1, dat2) ``` If one data frame has less records than the other, the shortest one is recycled over (just like when you're adding two vectors of unequal length) ```{r} gower_dist(iris[1,], dat1) ``` It is possible to control how columns from the two data sets are paired for comparison using the `pair_x` and `pair_y` arguments. This comes in handy when similar colums have different names accross datasets. By default, columns with matching names are paired. The behaviour is somewhat similar to that of base R's `merge` in that respect. ```{r} dat1 <- dat2 <- iris[1:10,] names(dat2) <- tolower(names(dat2)) gower_dist(dat1, dat2) # tell gower_dist to match columns 1..5 in dat1 with column 1..5 in dat2 gower_dist(dat1, dat2, pair_y=1:5) ``` ## Computing the top-n matches The function `gower_topn` returns a list with two arrays. ```{r} dat1 <- iris[1:10,] L <- gower_topn(x=dat1, y=iris, n=3) L ``` The first array is called `index`. Each column corresponds to one row of `x`. The entries of each column index the top _n_ best matches of that row in x with rows in `y`. In this example, the best match of the first row of `dat1` is record number ```r L$index[1,1]``` from `iris` (this should be obvious, since they are the same record). The second best match is record number ```r L$index[2,1]``` from `iris`. The second array is called `distance` and it contains the corresponding distances. ## Parallelization, memory usage The underlying algorithm is implemented in C and parallelized using [OpenMP](http://www.openmp.org). OpenMP is available on most systems that can run R. Please see [this section](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#OpenMP-support) of the writing R extensions manual for up-to-details details on which systems are supported. At the time of writing (summer 2016), OSX is the only system not supporting OpenMP out of the box. You can still make it work by installing the gcc toolchain and compiling the package (and R). If OpenMP is not supported, the package will still work but the core algorithms will not be parallelized. This implementation makes no copies of the data in memory. When computing `gower_dist`, two double precision arrays of size _max(nrow(x),nrow(y))_ are kept in memory to store intermediate results. When computing the top-n matches, for _k_ cores, _k+2_ double precision arrays of length ```nrow(y)``` are created to store intermediate results at C level. gower/MD50000644000176200001440000000161513053661645011721 0ustar liggesusersfeb1cf5c6d28497cf375369bfa5c0cab *DESCRIPTION 83404f04f998736f089f40b74968e0a3 *NAMESPACE d2cd60900c18a09051290f7cb994771e *NEWS 2a15d8f245c2f6d3cd49f98d1c760bbd *R/gower-pkg.R 6152fcc7e8ce26080d8e7a10d70cc5bb *R/gower.R 00948546dfa81f53c4bebf838eb2d98a *build/vignette.rds dedc916a69d5cf53c8285208f4684caa *inst/doc/intro.R 060e532599d0221cfb9e3be9fa0329ae *inst/doc/intro.Rmd 7da7d4be933ba19e74db91a4a03db746 *inst/doc/intro.html 68b9193cbb24435b2ecb773b97b23ccb *man/gower-package.Rd 47011f697c87c9a551f9adbfb126f7b3 *man/gower_dist.Rd 8f4ea66daefdb202361ce25902917980 *man/gower_topn.Rd 4e95da77024f1ca60e63ce4aabf063c1 *src/Makevars 8acb5e89b5fccf4ee507995a333b5572 *src/R_register_native.c 2b99d922d1401ac4c1ebe96c2c93aa4d *src/gower.c 3009b35ff0db11976f27dba9963141f3 *tests/testthat.R 7fcab2ed91229b68f9a721450cb328d8 *tests/testthat/test_gower.R 060e532599d0221cfb9e3be9fa0329ae *vignettes/intro.Rmd gower/build/0000755000176200001440000000000013053641135012475 5ustar liggesusersgower/build/vignette.rds0000644000176200001440000000033213053641135015032 0ustar liggesusersmP0 /|!1⃯ LX_.vFMu;=>!& с ,qDžV2J[.YXȎٕ S !q _e Akbn&rϷzDYؽjjcMsO\.1NJ3]41ЃExQ?F|hHҽgower/DESCRIPTION0000644000176200001440000000146613053661645013123 0ustar liggesusersPackage: gower Maintainer: Mark van der Loo License: GPL-3 Title: Gower's Distance LazyData: no Type: Package LazyLoad: yes Authors@R: c( person("Mark", "van der Loo", role=c("aut","cre"),email="mark.vanderloo@gmail.com")) Description: Compute Gower's distance (or similarity) coefficient between records. Compute the top-n matches between records. Core algorithms are executed in parallel on systems supporting OpenMP. Version: 0.1.2 URL: https://github.com/markvanderloo/gower BugReports: https://github.com/markvanderloo/gower/issues Date: 2017-02-23 Suggests: testthat, knitr, rmarkdown RoxygenNote: 6.0.1 VignetteBuilder: knitr NeedsCompilation: yes Packaged: 2017-02-23 20:13:17 UTC; mark Author: Mark van der Loo [aut, cre] Repository: CRAN Date/Publication: 2017-02-23 23:35:17 gower/man/0000755000176200001440000000000013017032342012143 5ustar liggesusersgower/man/gower_dist.Rd0000644000176200001440000000413313053640755014616 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/gower.R \name{gower_dist} \alias{gower_dist} \title{Gower's distance} \usage{ gower_dist(x, y, pair_x = NULL, pair_y = NULL, eps = 1e-08, nthread = getOption("gd_num_thread")) } \arguments{ \item{x}{\code{[data.frame]}} \item{y}{\code{[data.frame]}} \item{pair_x}{\code{[numeric|character] (optional)} Columns in \code{x} used for comparison. See Details below.} \item{pair_y}{\code{[numeric|character] (optional)} Columns in \code{y} used for comparison. See Details below.} \item{eps}{\code{[numeric] (optional)} Computed numbers (variable ranges) smaller than \code{eps} are treated as zero.} \item{nthread}{Number of threads to use for parallelization. By default, for a dual-core machine, 2 threads are used. For any other machine n-1 cores are used so your machine doesn't freeze during a big computation. The maximum nr of threads are determined from \code{omp::get_max_threads}.} } \value{ A \code{numeric} vector of length \code{max(nrow(x),nrow(y))}. } \description{ Compute Gower's distance, pairwise between records in two data sets \code{x} and \code{y}. Records from the smallest data set are recycled over. } \section{Details}{ There are three ways to specify which columns of \code{x} should be compared with what columns of \code{y}. The first option is do give no specification. In that case columns with matching names will be used. The second option is to use only the \code{pairs_y} argument, specifying for each column in \code{x} in order, which column in \code{y} must be used to pair it with (use \code{0} to skip a column in \code{x}). The third option is to explicitly specify the columns to be matched using \code{pair_x} and \code{pair_y}. } \section{Note}{ Gower (1971) originally defined a similarity measure (\eqn{s}, say) with values ranging from 0 (completely dissimilar) to 1 (completely similar). The distance returned here equals \eqn{1-s}. } \references{ Gower, John C. "A general coefficient of similarity and some of its properties." Biometrics (1971): 857-871. } \seealso{ \code{\link{gower_topn}} } gower/man/gower_topn.Rd0000644000176200001440000000276013053640755014637 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/gower.R \name{gower_topn} \alias{gower_topn} \title{Find the top-n matches} \usage{ gower_topn(x, y, pair_x = NULL, pair_y = NULL, n = 5, eps = 1e-08, nthread = getOption("gd_num_thread")) } \arguments{ \item{x}{\code{[data.frame]}} \item{y}{\code{[data.frame]}} \item{pair_x}{\code{[numeric|character] (optional)} Columns in \code{x} used for comparison. See Details below.} \item{pair_y}{\code{[numeric|character] (optional)} Columns in \code{y} used for comparison. See Details below.} \item{n}{The top-n indices and distances to return.} \item{eps}{\code{[numeric] (optional)} Computed numbers (variable ranges) smaller than \code{eps} are treated as zero.} \item{nthread}{Number of threads to use for parallelization. By default, for a dual-core machine, 2 threads are used. For any other machine n-1 cores are used so your machine doesn't freeze during a big computation. The maximum nr of threads are determined from \code{omp::get_max_threads}.} } \value{ A \code{list} with two array elements: \code{index} and \code{distance}. Both have size \code{n X nrow(x)}. Each ith column corresponds to the top-n best matches of \code{x} with rows in \code{y}. } \description{ Find the top-n matches in \code{y} for each record in \code{x}. } \examples{ # find the top 4 best matches in the iris data set with itself. x <- iris[1:3,] lookup <- iris[1:10,] gower_topn(x=x,y=lookup,n=4) } \seealso{ \code{\link{gower_dist}} } gower/man/gower-package.Rd0000644000176200001440000000040213053640755015157 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/gower-pkg.R \docType{package} \name{gower-package} \alias{gower-package} \title{Gower's distance/similarity measure.} \description{ A C-based implementation of Gower's distance. }