stringdist/ 0000755 0001762 0000144 00000000000 14116405002 012435 5 ustar ligges users stringdist/NAMESPACE 0000644 0001762 0000144 00000000711 14116401512 013655 0 ustar ligges users # Generated by roxygen2: do not edit by hand export(afind) export(ain) export(amatch) export(extract) export(grab) export(grabl) export(phonetic) export(printable_ascii) export(qgrams) export(seq_ain) export(seq_amatch) export(seq_dist) export(seq_distmatrix) export(seq_qgrams) export(seq_sim) export(stringdist) export(stringdistmatrix) export(stringsim) export(stringsimmatrix) importFrom(parallel,detectCores) useDynLib(stringdist, .registration=TRUE) stringdist/README.md 0000644 0001762 0000144 00000005154 13737347222 013742 0 ustar ligges users [](https://github.com/SNStatComp/awesome-official-statistics-software) ## stringdist * Approximate matching and string distance calculations for R. * All distance and matching operations are system- and encoding-independent. * Built for speed, using [openMP](https://www.openmp.org/) for parallel computing. The package offers the following main functions: * `stringdist` computes pairwise distances between two input character vectors (shorter one is recycled) * `stringdistmatrix` computes the distance matrix for one or two vectors * `stringsim` computes a string similarity between 0 and 1, based on `stringdist` * `amatch` is a fuzzy matching equivalent of R's native `match` function * `ain` is a fuzzy matching equivalent of R's native `%in%` operator * `seq_dist`, `seq_distmatrix`, `seq_amatch` and `seq_ain` for distances between, and matching of integer sequences. These functions are built upon `C`-code that re-implements some common (weighted) string distance functions. Distance functions include: * Hamming distance; * Levenshtein distance (weighted) * Restricted Damerau-Levenshtein distance (weighted, a.k.a. Optimal String Alignment) * Full Damerau-Levenshtein distance * Longest Common Substring distance * Q-gram distance * cosine distance for q-gram count vectors (= 1-cosine similarity) * Jaccard distance for q-gram count vectors (= 1-Jaccard similarity) * Jaro, and Jaro-Winkler distance * Soundex-based string distance Also, there are some utility functions: * `qgrams()` tabulates the qgrams in one or more `character` vectors. * `seq_qrams()` tabulates the qgrams (somtimes called ngrams) in one or more `integer` vectors. * `phonetic()` computes phonetic codes of strings (currently only soundex) * `printable_ascii()` is a utility function that detects non-printable ascii or non-ascii characters. #### C API Some of `stringdist`'s underlying `C` functions can be called directly from `C` code in other packages. The description of the API can be found by either typing `?stringdist_api` in the R console or open the vignette directly as follows: ``` vignette("stringdist_C-Cpp_api", package="stringdist") ``` Examples of packages that link to `stringdist` can be found [here](https://github.com/markvanderloo/linkstringdist) and [here](https://github.com/ChrisMuir/refinr). #### Resources * A [paper](https://journal.r-project.org/archive/2014-1/loo.pdf) on stringdist has been published in the R-journal * [Slides](https://www.markvanderloo.eu/files/statistics/stringdist_useR2014.pdf) of a talk given at te _useR!2014_ conference. stringdist/man/ 0000755 0001762 0000144 00000000000 13703403264 013221 5 ustar ligges users stringdist/man/stringdist-encoding.Rd 0000644 0001762 0000144 00000006550 13703403264 017474 0 ustar ligges users % Generated by roxygen2: do not edit by hand % Please edit documentation in R/doc_encoding.R \name{stringdist-encoding} \alias{stringdist-encoding} \title{String metrics in \pkg{stringdist}} \description{ This page gives an overview of encoding handling in \pkg{stringst}. } \section{Encoding in \pkg{stringdist}}{ All character strings are stored as a sequence of bytes. An encoding system relates a byte, or a short sequence of bytes to a symbol. Over the years, many encoding systems have been developed, and not all OS's and softwares use the same encoding as default. Similarly, depending on the system R is running on, R may use a different encoding for storing strings internally. The \pkg{stringdist} package is designed so users in principle need not worry about this. Strings are converted to \code{UTF-32} (unsigned integer) by default prior to any further computation. This means that results are encoding-independent and that strings are interpreted as a sequence of symbols, not as a sequence of pure bytes. In functions where this is relevant, this may be switched by setting the \code{useBytes} option to \code{TRUE}. However, keep in mind that results will then likely depend on the system R is running on, except when your strings are pure ASCII. Also, for multi-byte encodings, results for byte-wise computations will usually differ from results using encoded computations. Prior to \pkg{stringdist} version 0.9, setting \code{useBytes=TRUE} could give a significant performance enhancement. Since version 0.9, translation to integer is done by C code internal to \pkg{stringdist} and the difference in performance is now negligible. } \section{Unicode normalisation}{ In \code{utf-8}, the same (accented) character may be represented as several byte sequences. For example, an u-umlaut can be represented with a single byte code or as a byte code representing \code{'u'} followed by a modifier byte code that adds the umlaut. The \href{https://cran.r-project.org/package=stringi}{stringi} package of Gagolevski and Tartanus offers unicode normalisation tools. } \section{Some tips on character encoding and transliteration}{ Some algorithms (like soundex) are defined only on the printable ASCII character set. This excludes any character with accents for example. Translating accented characters to the non-accented ones is a form of transliteration. On many systems running R (but not all!) you can achieve this with \code{iconv(x,to="ASCII//TRANSLIT")}, where \code{x} is your character vector. See the documentation of \code{\link[base]{iconv}} for details. The \code{stringi} package (Gagolewski and Tartanus) should work on any system. The command \code{stringi::stri_trans_general(x,"Latin-ASCII")} transliterates character vector \code{x} to ASCII. } \references{ \itemize{ \item{The help page of \code{\link[base]{Encoding}}} describes how R handles encoding. \item{The help page of \code{\link[base]{iconv}} has a good overview of base R's encoding conversion options. The capabilities of \code{iconv} depend on the system R is running on. The \pkg{stringi} package offers platform-independent encoding and normalization tools.} } } \seealso{ \itemize{ \item{Functions using re-encoding: \code{\link{stringdist}}, \code{\link{stringdistmatrix}}, \code{\link{amatch}}, \code{\link{ain}}, \code{\link{qgrams}}} \item{Encoding related: \code{\link{printable_ascii}}} } } stringdist/man/seq_amatch.Rd 0000644 0001762 0000144 00000010472 13703554555 015633 0 ustar ligges users % Generated by roxygen2: do not edit by hand % Please edit documentation in R/amatch.R \name{seq_amatch} \alias{seq_amatch} \alias{seq_ain} \title{Approximate matching for integer sequences.} \usage{ seq_amatch( x, table, nomatch = NA_integer_, matchNA = TRUE, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw"), weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = 0.1, q = 1, p = 0, bt = 0, nthread = getOption("sd_num_thread") ) seq_ain(x, table, ...) } \arguments{ \item{x}{(\code{list} of) \code{integer} or \code{numeric} vector(s) to be approximately matched. Will be converted with \code{as.integer}.} \item{table}{(\code{list} of) \code{integer} or \code{numeric} vector(s) serving as lookup table for matching. Will be converted with \code{as.integer}.} \item{nomatch}{The value to be returned when no match is found. This is coerced to integer.} \item{matchNA}{Should \code{NA}'s be matched? Default behaviour mimics the behaviour of base \code{\link[base]{match}}, meaning that \code{NA} matches \code{NA}. With \code{NA}, we mean a missing entry in the \code{list}, represented as \code{NA_integer_}. If one of the integer sequences stored in the list has an \code{NA} entry, this is just treated as another integer (the representation of \code{NA_integer_}).} \item{method}{Matching algorithm to use. See \code{\link{stringdist-metrics}}.} \item{weight}{For \code{method='osa'} or \code{'dl'}, the penalty for deletion, insertion, substitution and transposition, in that order. When \code{method='lv'}, the penalty for transposition is ignored. When \code{method='jw'}, the weights associated with integers in elements of \code{a}, integers in elements of \code{b} and the transposition weight, in that order. Weights must be positive and not exceed 1. \code{weight} is ignored completely when \code{method='hamming'}, \code{'qgram'}, \code{'cosine'}, \code{'Jaccard'}, or \code{'lcs'}.} \item{maxDist}{Elements in \code{x} will not be matched with elements of \code{table} if their distance is larger than \code{maxDist}. Note that the maximum distance between strings depends on the method: it should always be specified.} \item{q}{q-gram size, only when method is \code{'qgram'}, \code{'jaccard'}, or \code{'cosine'}.} \item{p}{Winkler's prefix parameter for Jaro-Winkler distance, with \eqn{0\leq p\leq0.25}. Only when method is \code{'jw'}} \item{bt}{Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than \code{bt}. Applies only to \code{method='jw'} and \code{p>0}.} \item{nthread}{Number of threads used by the underlying C-code. A sensible default is chosen, see \code{\link{stringdist-parallelization}}.} \item{...}{parameters to pass to \code{seq_amatch} (except \code{nomatch})} } \value{ \code{seq_amatch} returns the position of the closest match of \code{x} in \code{table}. When multiple matches with the same minimal distance metric exist, the first one is returned. \code{seq_ain} returns a \code{logical} vector of length \code{length(x)} indicating wether an element of \code{x} approximately matches an element in \code{table}. } \description{ For a \code{list} of integer vectors \code{x}, find the closest matches in a \code{list} of integer or numeric vectors in \code{table.} } \section{Notes}{ \code{seq_ain} is currently defined as \code{seq_ain(x,table,...) <- function(x,table,...) amatch(x, table, nomatch=0,...) > 0} All input vectors are converted with \code{as.integer}. This causes truncation for numeric vectors (e.g. \code{pi} will be treated as \code{3L}). } \examples{ x <- list(1:3,c(3:1),c(1L,3L,4L)) table <- list( c(5L,3L,1L,2L) ,1:4 ) seq_amatch(x,table,maxDist=2) # behaviour with missings seq_amatch(list(c(1L,NA_integer_,3L),NA_integer_), list(1:3),maxDist=1) \dontrun{ # Match sentences based on word order. Note: words must match exactly or they # are treated as completely different. # # For this example you need to have the 'hashr' package installed. x <- "Mary had a little lamb" x.words <- strsplit(x,"[[:blank:]]+") x.int <- hashr::hash(x.words) table <- c("a little lamb had Mary", "had Mary a little lamb") table.int <- hashr::hash(strsplit(table,"[[:blank:]]+")) seq_amatch(x.int,table.int,maxDist=3) } } \seealso{ \code{\link{seq_dist}}, \code{\link{seq_sim}}, \code{\link{seq_qgrams}} } stringdist/man/stringdist-package.Rd 0000644 0001762 0000144 00000005652 13703403264 017303 0 ustar ligges users % Generated by roxygen2: do not edit by hand % Please edit documentation in R/stringdist.R \docType{package} \name{stringdist-package} \alias{stringdist-package} \title{A package for string distance calculation and approximate string matching.} \description{ The \pkg{stringdist} package offers fast and platform-independent string metrics. Its main purpose is to compute various string distances and to do approximate text matching between character vectors. As of version 0.9.3, it is also possible to compute distances between sequences represented by integer vectors. } \details{ A typical use is to match strings that are not precisely the same. For example \code{ amatch(c("hello","g'day"),c("hi","hallo","ola"),maxDist=2)} returns \code{c(2,NA)} since \code{"hello"} matches closest with \code{"hallo"}, and within the maximum (optimal string alignment) distance. The second element, \code{"g'day"}, matches closest with \code{"ola"} but since the distance equals 4, no match is reported. A second typical use is to compute string distances. For example \code{ stringdist(c("g'day"),c("hi","hallo","ola"))} Returns \code{c(5,5,4)} since these are the distances between \code{"g'day"} and respectively \code{"hi"}, \code{"hallo"}, and \code{"ola"}. A third typical use would be to compute a \code{dist} object. The command \code{stringdistmatrix(c("foo","bar","boo","baz"))} returns an object of class \code{dist} that can be used by clustering algorithms such as \code{stats::hclust}. A fourth use is to compute string distances between general sequences, represented as integer vectors (which must be stored in a \code{list}): \code{seq_dist( list(c(1L,1L,2L)), list(c(1L,2L,1L),c(2L,3L,1L,2L)) )} The above code yields the vector \code{c(1,2)} (the first shorter first argument is recycled over the longer second argument) Besides documentation for each function, the main topics documented are: \itemize{ \item{\code{\link{stringdist-metrics}} -- string metrics supported by the package} \item{\code{\link{stringdist-encoding}} -- how encoding is handled by the package} \item{\code{\link{stringdist-parallelization}} -- on multithreading } } } \section{Acknowledgements}{ \itemize{ \item{The code for the full Damerau-Levenshtein distance was adapted from Nick Logan's \href{https://github.com/ugexe/Text--Levenshtein--Damerau--XS/blob/master/damerau-int.c}{public github repository}.} \item{C code for converting UTF-8 to integer was copied from the R core for performance reasons.} \item{The code for soundex conversion and string similarity was kindly contributed by Jan van der Laan.} } } \section{Citation}{ If you would like to cite this package, please cite the \href{https://journal.r-project.org/archive/2014-1/loo.pdf}{R Journal Paper}: \itemize{ \item{M.P.J. van der Loo (2014). The \code{stringdist} package for approximate string matching. R Journal 6(1) pp 111-122} } Or use \code{citation('stringdist')} to get a bibtex item. } stringdist/man/seq_sim.Rd 0000644 0001762 0000144 00000002714 13703403264 015154 0 ustar ligges users % Generated by roxygen2: do not edit by hand % Please edit documentation in R/stringsim.R \name{seq_sim} \alias{seq_sim} \title{Compute similarity scores between sequences of integers} \usage{ seq_sim( a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw"), q = 1, ... ) } \arguments{ \item{a}{\code{list} of \code{integer} vectors (target)} \item{b}{\code{list} of \code{integer} vectors (source). Optional for \code{seq_distmatrix}.} \item{method}{Method for distance calculation. The default is \code{"osa"}, see \code{\link{stringdist-metrics}}.} \item{q}{Size of the \eqn{q}-gram; must be nonnegative. Only applies to \code{method='qgram'}, \code{'jaccard'} or \code{'cosine'}.} \item{...}{additional arguments are passed on to \code{\link{seq_dist}}.} } \value{ A \code{numeric} vector of length \code{max(length(a),length(b))}. If one of the entries in \code{a} or \code{b} is \code{NA_integer_}, all comparisons with that element result in \code{NA}. Missings occurring within the sequences are treated as an ordinary number (the representation of \code{NA_integer_}). } \description{ Compute similarity scores between sequences of integers } \examples{ L1 <- list(1:3,2:4) L2 <- list(1:3) seq_sim(L1,L2,method="osa") # note how missing values are handled (L2 is recycled over L1) L1 <- list(c(1L,NA_integer_,3L),2:4,NA_integer_) L2 <- list(1:3) seq_sim(L1,L2) } \seealso{ \code{\link{seq_dist}}, \code{\link{seq_amatch}} } stringdist/man/seq_dist.Rd 0000644 0001762 0000144 00000011030 13703554555 015330 0 ustar ligges users % Generated by roxygen2: do not edit by hand % Please edit documentation in R/seqdist.R \name{seq_dist} \alias{seq_dist} \alias{seq_distmatrix} \title{Compute distance metrics between integer sequences} \usage{ seq_dist( a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw"), weight = c(d = 1, i = 1, s = 1, t = 1), q = 1, p = 0, bt = 0, nthread = getOption("sd_num_thread") ) seq_distmatrix( a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw"), weight = c(d = 1, i = 1, s = 1, t = 1), q = 1, p = 0, bt = 0, useNames = c("names", "none"), nthread = getOption("sd_num_thread") ) } \arguments{ \item{a}{(\code{list} of) \code{integer} or \code{numeric} vector(s). Will be converted with \code{as.integer} (target)} \item{b}{(\code{list} of) \code{integer} or \code{numeric} vector(s). Will be converted with \code{as.integer} (source). Optional for \code{seq_distmatrix}.} \item{method}{Distance metric. See \code{\link{stringdist-metrics}}} \item{weight}{For \code{method='osa'} or \code{'dl'}, the penalty for deletion, insertion, substitution and transposition, in that order. When \code{method='lv'}, the penalty for transposition is ignored. When \code{method='jw'}, the weights associated with characters of \code{a}, characters from \code{b} and the transposition weight, in that order. Weights must be positive and not exceed 1. \code{weight} is ignored completely when \code{method='hamming'}, \code{'qgram'}, \code{'cosine'}, \code{'Jaccard'}, or \code{'lcs'}} \item{q}{Size of the \eqn{q}-gram; must be nonnegative. Only applies to \code{method='qgram'}, \code{'jaccard'} or \code{'cosine'}.} \item{p}{Prefix factor for Jaro-Winkler distance. The valid range for \code{p} is \code{0 <= p <= 0.25}. If \code{p=0} (default), the Jaro-distance is returned. Applies only to \code{method='jw'}.} \item{bt}{Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than \code{bt} Applies only to \code{method='jw'} and \code{p>0}.} \item{nthread}{Maximum number of threads to use. By default, a sensible number of threads is chosen, see \code{\link{stringdist-parallelization}}.} \item{useNames}{label the output matrix with \code{names(a)} and \code{names(b)}?} } \value{ \code{seq_dist} returns a numeric vector with pairwise distances between \code{a} and \code{b} of length \code{max(length(a),length(b)}. For \code{seq_distmatrix} there are two options. If \code{b} is missing, the \code{\link[stats]{dist}} object corresponding to the \code{length(a) X length(a)} distance matrix is returned. If \code{b} is specified, the \code{length(a) X length(b)} distance matrix is returned. If any element of \code{a} or \code{b} is \code{NA_integer_}, the distance with any matched integer vector will result in \code{NA}. Missing values in the sequences themselves are treated as a number and not treated specially (Also see the examples). } \description{ \code{seq_dist} computes pairwise string distances between elements of \code{a} and \code{b}, where the argument with less elements is recycled. \code{seq_distmatrix} computes the distance matrix with rows according to \code{a} and columns according to \code{b}. } \section{Notes}{ Input vectors are converted with \code{as.integer}. This causes truncation for numeric vectors (e.g. \code{pi} will be treated as \code{3L}). } \examples{ # Distances between lists of integer vectors. Note the postfix 'L' to force # integer storage. The shorter argument is recycled over (\code{a}) a <- list(c(102L, 107L)) # fu b <- list(c(102L,111L,111L),c(102L,111L,111L)) # foo, fo seq_dist(a,b) # translate strings to a list of integer sequences a <- lapply(c("foo","bar","baz"),utf8ToInt) seq_distmatrix(a) # Note how missing values are treated. NA's as part of the sequence are treated # as an integer (the representation of NA_integer_). a <- list(NA_integer_,c(102L, 107L)) b <- list(c(102L,111L,111L),c(102L,111L,NA_integer_)) seq_dist(a,b) \dontrun{ # Distance between sentences based on word order. Note: words must match exactly or they # are treated as completely different. # # For this example you need to have the 'hashr' package installed. a <- "Mary had a little lamb" a.words <- strsplit(a,"[[:blank:]]+") a.int <- hashr::hash(a.words) b <- c("a little lamb had Mary", "had Mary a little lamb") b.int <- hashr::hash(strsplit(b,"[[:blank:]]+")) seq_dist(a.int,b.int) } } \seealso{ \code{\link{seq_sim}}, \code{\link{seq_amatch}}, \code{\link{seq_qgrams}} } stringdist/man/qgrams.Rd 0000644 0001762 0000144 00000004653 13703403264 015012 0 ustar ligges users % Generated by roxygen2: do not edit by hand % Please edit documentation in R/qgrams.R \name{qgrams} \alias{qgrams} \title{Get a table of qgram counts from one or more character vectors.} \usage{ qgrams(..., .list = NULL, q = 1L, useBytes = FALSE, useNames = !useBytes) } \arguments{ \item{...}{any number of (named) arguments, that will be coerced to character with \code{as.character}.} \item{.list}{Will be concatenated with the \code{...} argument(s). Useful for adding character vectors named \code{'q'} or \code{'useNames'}.} \item{q}{size of q-gram, must be non-negative.} \item{useBytes}{Determine byte-wise qgrams. \code{useBytes=TRUE} is faster but may yield different results depending on character encoding. For \code{ASCII} it is identical. See also \code{\link{stringdist}} under Encoding issues.} \item{useNames}{Add q-grams as column names. If \code{useBytes=useNames=TRUE}, the q-byte sequences are represented as 2 hexadecimal numbers per byte, separated by a vertical bar (\code{|}).} } \value{ A table with \eqn{q}-gram counts. Detected \eqn{q}-grams are column names and the argument names as row names. If no argument names were provided, they will be generated. } \description{ Get a table of qgram counts from one or more character vectors. } \section{Details}{ The input is converted to \code{character}. If \code{useBytes=TRUE}, each element is converted to \code{utf8} and then to \code{integer} as in \code{\link{stringdist}}. Next,the data is passed to the underlying routine. Strings with less than \code{q} characters and elements containing \code{NA} are skipped. Using \code{q=0} therefore counts the number of empty strings \code{""} occuring in each argument. } \examples{ qgrams('hello world',q=3) # q-grams are counted uniquely over a character vector qgrams(rep('hello world',2),q=3) # to count them separately, do something like x <- c('hello', 'world') lapply(x,qgrams, q=3) # output rows may be named, and you can pass any number of character vectors x <- "I will not buy this record, it is scratched" y <- "My hovercraft is full of eels" z <- c("this", "is", "a", "dead","parrot") qgrams(A = x, B = y, C = z,q=2) # a tonque twister, showing the effects of useBytes and useNames x <- "peter piper picked a peck of pickled peppers" qgrams(x, q=2) qgrams(x, q=2, useNames=FALSE) qgrams(x, q=2, useBytes=TRUE) qgrams(x, q=2, useBytes=TRUE, useNames=TRUE) } \seealso{ \code{\link{stringdist}}, \code{\link{amatch}} } stringdist/man/stringdist.Rd 0000644 0001762 0000144 00000014275 13703554555 015725 0 ustar ligges users % Generated by roxygen2: do not edit by hand % Please edit documentation in R/stringdist.R \name{stringdist} \alias{stringdist} \alias{stringdistmatrix} \title{Compute distance metrics between strings} \usage{ stringdist( a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), q = 1, p = 0, bt = 0, nthread = getOption("sd_num_thread") ) stringdistmatrix( a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), q = 1, p = 0, bt = 0, useNames = c("none", "strings", "names"), nthread = getOption("sd_num_thread") ) } \arguments{ \item{a}{R object (target); will be converted by \code{as.character}} \item{b}{R object (source); will be converted by \code{as.character} This argument is optional for \code{stringdistmatrix} (see section \code{Value}).} \item{method}{Method for distance calculation. The default is \code{"osa"}, see \code{\link{stringdist-metrics}}.} \item{useBytes}{Perform byte-wise comparison, see \code{\link{stringdist-encoding}}.} \item{weight}{For \code{method='osa'} or \code{'dl'}, the penalty for deletion, insertion, substitution and transposition, in that order. When \code{method='lv'}, the penalty for transposition is ignored. When \code{method='jw'}, the weights associated with characters of \code{a}, characters from \code{b} and the transposition weight, in that order. Weights must be positive and not exceed 1. \code{weight} is ignored completely when \code{method='hamming'}, \code{'qgram'}, \code{'cosine'}, \code{'Jaccard'}, \code{'lcs'}, or \code{soundex}.} \item{q}{Size of the \eqn{q}-gram; must be nonnegative. Only applies to \code{method='qgram'}, \code{'jaccard'} or \code{'cosine'}.} \item{p}{Prefix factor for Jaro-Winkler distance. The valid range for \code{p} is \code{0 <= p <= 0.25}. If \code{p=0} (default), the Jaro-distance is returned. Applies only to \code{method='jw'}.} \item{bt}{Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than \code{bt}. Applies only to \code{method='jw'} and \code{p>0}.} \item{nthread}{Maximum number of threads to use. By default, a sensible number of threads is chosen, see \code{\link{stringdist-parallelization}}.} \item{useNames}{Use input vectors as row and column names?} } \value{ For \code{stringdist}, a vector with string distances of size \code{max(length(a),length(b))}. For \code{stringdistmatrix}: if both \code{a} and \code{b} are passed, a \code{length(a)xlength(b)} \code{matrix}. If a single argument \code{a} is given an object of class \code{\link[stats]{dist}} is returned. Distances are nonnegative if they can be computed, \code{NA} if any of the two argument strings is \code{NA} and \code{Inf} when \code{maxDist} is exceeded or, in case of the hamming distance, when the two compared strings have different length. } \description{ \code{stringdist} computes pairwise string distances between elements of \code{a} and \code{b}, where the argument with less elements is recycled. \code{stringdistmatrix} computes the string distance matrix with rows according to \code{a} and columns according to \code{b}. } \examples{ # Simple example using optimal string alignment stringdist("ca","abc") # computing a 'dist' object d <- stringdistmatrix(c('foo','bar','boo','baz')) # try plot(hclust(d)) # The following gives a matrix stringdistmatrix(c("foo","bar","boo"),c("baz","buz")) # An example using Damerau-Levenshtein distance (multiple editing of substrings allowed) stringdist("ca","abc",method="dl") # string distance matching is case sensitive: stringdist("ABC","abc") # so you may want to normalize a bit: stringdist(tolower("ABC"),"abc") # stringdist recycles the shortest argument: stringdist(c('a','b','c'),c('a','c')) # stringdistmatrix gives the distance matrix (by default for optimal string alignment): stringdist(c('a','b','c'),c('a','c')) # different edit operations may be weighted; e.g. weighted substitution: stringdist('ab','ba',weight=c(1,1,1,0.5)) # Non-unit weights for insertion and deletion makes the distance metric asymetric stringdist('ca','abc') stringdist('abc','ca') stringdist('ca','abc',weight=c(0.5,1,1,1)) stringdist('abc','ca',weight=c(0.5,1,1,1)) # Hamming distance is undefined for # strings of unequal lengths so stringdist returns Inf stringdist("ab","abc",method="h") # For strings of eqal length it counts the number of unequal characters as they occur # in the strings from beginning to end stringdist("hello","HeLl0",method="h") # The lcs (longest common substring) distance returns the number of # characters that are not part of the lcs. # # Here, the lcs is either 'a' or 'b' and one character cannot be paired: stringdist('ab','ba',method="lcs") # Here the lcs is 'surey' and 'v', 'g' and one 'r' of 'surgery' are not paired stringdist('survey','surgery',method="lcs") # q-grams are based on the difference between occurrences of q consecutive characters # in string a and string b. # Since each character abc occurs in 'abc' and 'cba', the q=1 distance equals 0: stringdist('abc','cba',method='qgram',q=1) # since the first string consists of 'ab','bc' and the second # of 'cb' and 'ba', the q=2 distance equals 4 (they have no q=2 grams in common): stringdist('abc','cba',method='qgram',q=2) # Wikipedia has the following example of the Jaro-distance. stringdist('MARTHA','MATHRA',method='jw') # Note that stringdist gives a _distance_ where wikipedia gives the corresponding # _similarity measure_. To get the wikipedia result: 1 - stringdist('MARTHA','MATHRA',method='jw') # The corresponding Jaro-Winkler distance can be computed by setting p=0.1 stringdist('MARTHA','MATHRA',method='jw',p=0.1) # or, as a similarity measure 1 - stringdist('MARTHA','MATHRA',method='jw',p=0.1) # This gives distance 1 since Euler and Gauss translate to different soundex codes. stringdist('Euler','Gauss',method='soundex') # Euler and Ellery translate to the same code and have distance 0 stringdist('Euler','Ellery',method='soundex') } \seealso{ \code{\link{stringsim}}, \code{\link{qgrams}}, \code{\link{amatch}}, \code{\link{afind}} } stringdist/man/phonetic.Rd 0000644 0001762 0000144 00000003666 13740013343 015330 0 ustar ligges users % Generated by roxygen2: do not edit by hand % Please edit documentation in R/phonetic.R \name{phonetic} \alias{phonetic} \title{Phonetic algorithms} \usage{ phonetic(x, method = c("soundex"), useBytes = FALSE) } \arguments{ \item{x}{a character vector whose elements are phonetically encoded.} \item{method}{name of the algorithm used. The default is \code{"soundex"}.} \item{useBytes}{Perform byte-wise comparison. \code{useBytes=TRUE} is faster but may yield different results depending on character encoding. For more information see the documentation of \code{\link{stringdist}}.} } \value{ The returns value depends on the method used. However, all currently implemented methods return a character vector of the same length of the input vector. Output characters are in the system's native encoding. } \description{ Translate strings to phonetic codes. Similar sounding strings should get similar or equal codes. } \details{ Currently, only the soundex algorithm is implemented. Note that soundex coding is only meaningful for characters in the ranges a-z and A-Z. Soundex coding of strings containing non-printable ascii or non-ascii characters may be system-dependent and should not be trusted. If non-ascii or non-printable ascii charcters are encountered, a warning is emitted. } \examples{ # The following examples are from The Art of Computer Programming (part III, p. 395) # (Note that our algorithm is specified different from the one in TACP, see references.) phonetic(c('Euler','Gauss','Hilbert','Knuth','Lloyd','Lukasiewicz','Wachs'),method='soundex') } \references{ \itemize{ \item{The Soundex algorithm implemented is the algorithm used by the \href{https://www.archives.gov/research/census/soundex}{National Archives}. This algorithm differs slightly from the original algorithm patented by R.C. Russell (US patents 1261167 (1918) and 1435663 (1922)). } } } \seealso{ \code{\link{printable_ascii}}, \code{\link{stringdist-package}} } stringdist/man/stringsim.Rd 0000644 0001762 0000144 00000005460 13703403264 015534 0 ustar ligges users % Generated by roxygen2: do not edit by hand % Please edit documentation in R/stringsim.R \name{stringsim} \alias{stringsim} \alias{stringsimmatrix} \title{Compute similarity scores between strings} \usage{ stringsim( a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, q = 1, ... ) stringsimmatrix( a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, q = 1, ... ) } \arguments{ \item{a}{R object (target); will be converted by \code{as.character}.} \item{b}{R object (source); will be converted by \code{as.character}.} \item{method}{Method for distance calculation. The default is \code{"osa"}, see \code{\link{stringdist-metrics}}.} \item{useBytes}{Perform byte-wise comparison, see \code{\link{stringdist-encoding}}.} \item{q}{Size of the \eqn{q}-gram; must be nonnegative. Only applies to \code{method='qgram'}, \code{'jaccard'} or \code{'cosine'}.} \item{...}{additional arguments are passed on to \code{\link{stringdist}} and \code{\link{stringdistmatrix}} respectively.} } \value{ \code{stringsim} returns a vector with similarities, which are values between 0 and 1 where 1 corresponds to perfect similarity (distance 0) and 0 to complete dissimilarity. \code{NA} is returned when \code{\link{stringdist}} returns \code{NA}. Distances equal to \code{Inf} are truncated to a similarity of 0. \code{stringsimmatrix} works the same way but, equivalent to \code{\link{stringdistmatrix}}, returns a similarity matrix instead of a vector. } \description{ \code{stringsim} computes pairwise string similarities between elements of \code{character} vectors \code{a} and \code{b}, where the vector with less elements is recycled. \code{stringsimmatrix} computes the string similarity matrix with rows according to \code{a} and columns according to \code{b}. } \details{ The similarity is calculated by first calculating the distance using \code{\link{stringdist}}, dividing the distance by the maximum possible distance, and substracting the result from 1. This results in a score between 0 and 1, with 1 corresponding to complete similarity and 0 to complete dissimilarity. Note that complete similarity only means equality for distances satisfying the identity property. This is not the case e.g. for q-gram based distances (for example if q=1, anagrams are completely similar). For distances where weights can be specified, the maximum distance is currently computed by assuming that all weights are equal to 1. } \examples{ # Calculate the similarity using the default method of optimal string alignment stringsim("ca", "abc") # Calculate the similarity using the Jaro-Winkler method # The p argument is passed on to stringdist stringsim('MARTHA','MATHRA',method='jw', p=0.1) } stringdist/man/amatch.Rd 0000644 0001762 0000144 00000012476 13703554555 014771 0 ustar ligges users % Generated by roxygen2: do not edit by hand % Please edit documentation in R/amatch.R \name{amatch} \alias{amatch} \alias{ain} \title{Approximate string matching} \usage{ amatch( x, table, nomatch = NA_integer_, matchNA = TRUE, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = 0.1, q = 1, p = 0, bt = 0, nthread = getOption("sd_num_thread") ) ain(x, table, ...) } \arguments{ \item{x}{elements to be approximately matched: will be coerced to \code{character} unless it is a list consisting of \code{integer} vectors.} \item{table}{lookup table for matching. Will be coerced to \code{character} unless it is a list consting of \code{integer} vectors.} \item{nomatch}{The value to be returned when no match is found. This is coerced to integer.} \item{matchNA}{Should \code{NA}'s be matched? Default behaviour mimics the behaviour of base \code{\link[base]{match}}, meaning that \code{NA} matches \code{NA} (see also the note on \code{NA} handling below).} \item{method}{Matching algorithm to use. See \code{\link{stringdist-metrics}}.} \item{useBytes}{Perform byte-wise comparison. See \code{\link{stringdist-encoding}}.} \item{weight}{For \code{method='osa'} or \code{'dl'}, the penalty for deletion, insertion, substitution and transposition, in that order. When \code{method='lv'}, the penalty for transposition is ignored. When \code{method='jw'}, the weights associated with characters of \code{a}, characters from \code{b} and the transposition weight, in that order. Weights must be positive and not exceed 1. \code{weight} is ignored completely when \code{method='hamming'}, \code{'qgram'}, \code{'cosine'}, \code{'Jaccard'}, \code{'lcs'}, or \code{'soundex'}.} \item{maxDist}{Elements in \code{x} will not be matched with elements of \code{table} if their distance is larger than \code{maxDist}. Note that the maximum distance between strings depends on the method: it should always be specified.} \item{q}{q-gram size, only when method is \code{'qgram'}, \code{'jaccard'}, or \code{'cosine'}.} \item{p}{Winklers 'prefix' parameter for Jaro-Winkler distance, with \eqn{0\leq p\leq0.25}. Only when method is \code{'jw'}} \item{bt}{Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than \code{bt}. Applies only to \code{method='jw'} and \code{p>0}.} \item{nthread}{Number of threads used by the underlying C-code. A sensible default is chosen, see \code{\link{stringdist-parallelization}}.} \item{...}{parameters to pass to \code{amatch} (except \code{nomatch})} } \value{ \code{amatch} returns the position of the closest match of \code{x} in \code{table}. When multiple matches with the same smallest distance metric exist, the first one is returned. \code{ain} returns a \code{logical} vector of length \code{length(x)} indicating wether an element of \code{x} approximately matches an element in \code{table}. } \description{ Approximate string matching equivalents of \code{R}'s native \code{\link[base]{match}} and \code{\%in\%}. } \details{ \code{ain} is currently defined as \code{ain(x,table,...) <- function(x,table,...) amatch(x, table, nomatch=0,...) > 0} } \section{Note on \code{NA} handling}{ \code{R}'s native \code{\link[base]{match}} function matches \code{NA} with \code{NA}. This may feel inconsistent with \code{R}'s usual \code{NA} handling, since for example \code{NA==NA} yields \code{NA} rather than \code{TRUE}. In most cases, one may reason about the behaviour under \code{NA} along the lines of ``if one of the arguments is \code{NA}, the result shall be \code{NA}'', simply because not all information necessary to execute the function is available. One uses special functions such as \code{is.na}, \code{is.null} \emph{etc.} to handle special values. The \code{amatch} function mimics the behaviour of \code{\link[base]{match}} by default: \code{NA} is matched with \code{NA} and with nothing else. Note that this is inconsistent with the behaviour of \code{\link{stringdist}} since \code{stringdist} yields \code{NA} when at least one of the arguments is \code{NA}. The same inconsistency exists between \code{\link[base]{match}} and \code{\link[utils]{adist}}. In \code{amatch} this behaviour can be controlled by setting \code{matchNA=FALSE}. In that case, if any of the arguments in \code{x} is \code{NA}, the \code{nomatch} value is returned, regardless of whether \code{NA} is present in \code{table}. In \code{\link[base]{match}} the behaviour can be controlled by setting the \code{incomparables} option. } \examples{ # lets see which sci-fi heroes are stringdistantly nearest amatch("leia",c("uhura","leela"),maxDist=5) # we can restrict the search amatch("leia",c("uhura","leela"),maxDist=1) # we can match each value in the find vector against values in the lookup table: amatch(c("leia","uhura"),c("ripley","leela","scully","trinity"),maxDist=2) # setting nomatch returns a different value when no match is found amatch("leia",c("uhura","leela"),maxDist=1,nomatch=0) # this is always true if maxDist is Inf ain("leia",c("uhura","leela"),maxDist=Inf) # Let's look in a neighbourhood of maximum 2 typo's (by default, the OSA algorithm is used) ain("leia",c("uhura","leela"), maxDist=2) } \seealso{ Other matching: \code{\link{afind}()} } \concept{matching} stringdist/man/stringdist-parallelization.Rd 0000644 0001762 0000144 00000004610 13703403264 021073 0 ustar ligges users % Generated by roxygen2: do not edit by hand % Please edit documentation in R/doc_parallel.R \name{stringdist-parallelization} \alias{stringdist-parallelization} \title{Multithreading and parallelization in \pkg{stringdist}} \description{ This page describes how \pkg{stringdist} uses parallel processing. } \section{Multithreading and parallelization in \pkg{stringdist}}{ The core functions of \pkg{stringdist} are implemented in C. On systems where \code{openMP} is available, \pkg{stringdist} will automatically take advantage of multiple cores. The \href{https://cran.r-project.org/doc/manuals/r-release/R-exts.html#OpenMP-support}{section on OpenMP} of the \href{https://cran.r-project.org/doc/manuals/r-release/R-exts.html}{Writing R Extensions} manual discusses on what systems OpenMP is available (at the time of writing more or less anywhere except on OSX). By default, the number of threads to use is taken from \code{options('sd_num_thread')}. When the package is loaded, the value for this option is determined as follows: \itemize{ \item{If the environment variable \code{OMP_NUM_THREADS} is set, this value is taken.} \item{Otherwise, the number of available cores is determined with \code{parallel::detectCores()} If this fails, the number of threads is set to 1 (with a message). If the nr of detected cores exceeds three, the number of used cores is set to \eqn{n-1}.} \item{If available, the environment variable \code{OMP_THREAD_LIMIT} is determined and The number of threads is set to the lesser of \code{OMP_THREAD_LIMIT} and the number of detected cores.} } The latter step makes sure that on machines with \eqn{n>3} cores, \eqn{n-1} cores are used. Some benchmarking showed that using all cores is often slower in such cases. This is probably because at least one of the threads will be shared with the operating system. Functions that use multithreading have an option named \code{nthread} that controls the maximum number of threads to use. If you need to do large calculations, it is probably a good idea to benchmark the performance on your machine(s) as a function of \code{'nthread'}, for example using the \href{https://cran.r-project.org/package=microbenchmark}{microbenchmark} package of Mersmann. } \seealso{ \itemize{ \item{Functions running multithreaded: \code{\link{stringdist}}, \code{\link{stringdistmatrix}}, \code{\link{amatch}}, \code{\link{ain}} } } } stringdist/man/afind.Rd 0000644 0001762 0000144 00000012121 13704053015 014562 0 ustar ligges users % Generated by roxygen2: do not edit by hand % Please edit documentation in R/afind.R \name{afind} \alias{afind} \alias{grab} \alias{grabl} \alias{extract} \title{Stringdist-based fuzzy text search} \usage{ afind( x, pattern, window = NULL, value = TRUE, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "running_cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), q = 1, p = 0, bt = 0, nthread = getOption("sd_num_thread") ) grab(x, pattern, maxDist = Inf, value = FALSE, ...) grabl(x, pattern, maxDist = Inf, ...) extract(x, pattern, maxDist = Inf, ...) } \arguments{ \item{x}{strings to search in} \item{pattern}{strings to find (not a regular expression). For \code{grab}, \code{grabl}, and \code{extract} this must be a single string.} \item{window}{width of moving window.} \item{value}{toggle return matrix with matched strings.} \item{method}{Matching algorithm to use. See \code{\link{stringdist-metrics}}.} \item{useBytes}{Perform byte-wise comparison. See \code{\link{stringdist-encoding}}.} \item{weight}{For \code{method='osa'} or \code{'dl'}, the penalty for deletion, insertion, substitution and transposition, in that order. When \code{method='lv'}, the penalty for transposition is ignored. When \code{method='jw'}, the weights associated with characters of \code{a}, characters from \code{b} and the transposition weight, in that order. Weights must be positive and not exceed 1. \code{weight} is ignored completely when \code{method='hamming'}, \code{'qgram'}, \code{'cosine'}, \code{'Jaccard'}, \code{'lcs'}, or \code{'soundex'}.} \item{q}{q-gram size, only when method is \code{'qgram'}, \code{'jaccard'}, or \code{'cosine'}.} \item{p}{Winklers 'prefix' parameter for Jaro-Winkler distance, with \eqn{0\leq p\leq0.25}. Only when method is \code{'jw'}} \item{bt}{Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than \code{bt}. Applies only to \code{method='jw'} and \code{p>0}.} \item{nthread}{Number of threads used by the underlying C-code. A sensible default is chosen, see \code{\link{stringdist-parallelization}}.} \item{maxDist}{Only windows with distance \code{<= maxDist} are considered a match.} \item{...}{passed to \code{afind}.} } \value{ For \code{afind}: a \code{list} of three matrices, each with \code{length(x)} rows and \code{length(pattern)} columns. In each matrix, element \eqn{(i,j)} corresponds to \code{x[i]} and \code{pattern[j]}. The names and description of each matrix is as follows. \itemize{ \item{\code{location}. \code{[integer]}, location of the start of best matching window. When \code{useBytes=FALSE}, this corresponds to the location of a \code{UTF} code point in \code{x}, possibly after conversion from its original encoding.} \item{\code{distance}. \code{[character]}, the string distance between pattern and the best matching window.} \item{\code{match}. \code{[character]}, the first, best matching window.} } For \code{grab}, an \code{integer} vector, indicating in which elements of \code{x} a match was found with a distance \code{<= maxDist}. The matched values when \code{value=TRUE} (equivalent to \code{\link[base]{grep}}). For \code{grabl}, a \code{logical} vector, indicating in which elements of \code{x} a match was found with a distance \code{<= maxDist}. (equivalent to \code{\link[base:grep]{grepl}}). For \code{extract}, a \code{character} matrix with \code{length(x)} rows and \code{length(pattern)} columns. If match was found, element \eqn{(i,j)} contains the match, otherwise it is set to \code{NA}. } \description{ \code{afind} slides a window of fixed width over a string \code{x} and computes the distance between the each window and the sought-after \code{pattern}. The location, content, and distance corresponding to the window with the best match is returned. } \details{ Matching is case-sensitive. Both \code{x} and \code{pattern} are converted to \code{UTF-8} prior to search, unless \code{useBytes=TRUE}, in which case the distances are measured bytewise. Code is parallelized over the \code{x} variable: each value of \code{x} is scanned for every element in \code{pattern} using a separate thread (when \code{nthread} is larger than 1). The functions \code{grab} and \code{grabl} are approximate string matching functions that somewhat resemble base R's \code{\link[base]{grep}} and \code{\link[base:grep]{grepl}}. They are implemented as convenience wrappers of \code{afind}. } \section{Running cosine distance}{ This algorithm gains efficiency by using that two consecutive windows have a large overlap in their q-gram profiles. It gives the same result as the \code{"cosine"} distance, but much faster. } \examples{ texts = c("When I grow up, I want to be" , "one of the harvesters of the sea" , "I think before my days are gone" , "I want to be a fisherman") patterns = c("fish", "gone","to be") afind(texts, patterns, method="running_cosine", q=3) grabl(texts,"grew", maxDist=1) extract(texts, "harvested", maxDist=3) } \seealso{ Other matching: \code{\link{amatch}()} } \concept{matching} stringdist/man/stringdist-api.Rd 0000644 0001762 0000144 00000000647 13703403264 016460 0 ustar ligges users % Generated by roxygen2: do not edit by hand % Please edit documentation in R/doc_api.R \name{stringdist_api} \alias{stringdist_api} \title{Calling stringdist from \code{C} or \code{C++}} \description{ As of version \code{0.9.5.0} several \code{C} level functions can be linked to and called from C code in other R packages. A description of the API can be found in \href{../doc/stringdist_api.pdf}{stringdist_api.pdf}. } stringdist/man/seq_qgrams.Rd 0000644 0001762 0000144 00000002055 13703403264 015654 0 ustar ligges users % Generated by roxygen2: do not edit by hand % Please edit documentation in R/qgrams.R \name{seq_qgrams} \alias{seq_qgrams} \title{Get a table of qgram counts for integer sequences} \usage{ seq_qgrams(..., .list = NULL, q = 1L) } \arguments{ \item{...}{Any number of (named) arguments that will be coerced with \code{as.integer}} \item{.list}{Will be concatenated with the \code{...} argument(s). Useful for adding integer vectors named 'q'.} \item{q}{The size of q-gramming.} } \value{ A \code{matrix} containing q-gram profiles. Columns 1 to \code{q} contain the encountered q-grams. The ensuing (named) columns contain the q-gram counts per vector. Run the example for a simple overview. Missing values in integer sequences are treated as any other number. } \description{ Get a table of qgram counts for integer sequences } \examples{ # compare the 2-gram overlap between sequences 1:3 and 2:4 seq_qgrams(x = 1:3, y=2:4,q=2) # behavior when NA's are present. seq_qgrams(1:3,c(1,NA,2),NA_integer_) } \seealso{ \code{\link{seq_dist}}, \code{\link{seq_amatch}} } stringdist/man/stringdist-metrics.Rd 0000644 0001762 0000144 00000020363 13737346674 017374 0 ustar ligges users % Generated by roxygen2: do not edit by hand % Please edit documentation in R/doc_metrics.R \name{stringdist-metrics} \alias{stringdist-metrics} \title{String metrics in \pkg{stringdist}} \description{ This page gives an overview of the string dissimilarity measures offered by \pkg{stringdist}. } \section{String Metrics}{ String metrics are ways of quantifying the dissimilarity between two finite sequences, usually text strings. Over the years, many such measures have been developed. Some are based on a mathematical understanding of the set of all strings that can be composed from a finite alphabet, others are based on more heuristic principles, such as how a text string sounds when pronounced by a native English speaker. The terms 'string metrics' and 'string distance' are used more or less interchangibly in literature. From a mathematical point of view, string metrics often do not obey the demands that are usually required from a distance function. For example, it is not true for all string metrics that a distance of 0 means that two strings are the same (e.g. in the \eqn{q}-gram distance). Nevertheless, string metrics are very useful in practice and have many applications. The metric you need to choose for an application strongly depends on both the nature of the string (what does the string represent?) and the cause of dissimilarities between the strings you are measuring. For example, if you are comparing human-typed names that may contain typo's, the Jaro-Winkler distance may be of use. If you are comparing names that were written down after hearing them, a phonetic distance may be a better choice. Currently, the following distance metrics are supported by \pkg{stringdist}. \tabular{ll}{ \bold{Method name} \tab \bold{Description}\cr \code{osa} \tab Optimal string aligment, (restricted Damerau-Levenshtein distance).\cr \code{lv} \tab Levenshtein distance (as in R's native \code{\link[utils]{adist}}).\cr \code{dl} \tab Full Damerau-Levenshtein distance.\cr \code{hamming} \tab Hamming distance (\code{a} and \code{b} must have same nr of characters).\cr \code{lcs} \tab Longest common substring distance.\cr \code{qgram} \tab \eqn{q}-gram distance. \cr \code{cosine} \tab cosine distance between \eqn{q}-gram profiles \cr \code{jaccard} \tab Jaccard distance between \eqn{q}-gram profiles \cr \code{jw} \tab Jaro, or Jaro-Winkler distance.\cr \code{soundex} \tab Distance based on soundex encoding (see below) } } \section{A short description of string metrics supported by \pkg{stringdist}}{ See \href{https://journal.r-project.org/archive/2014-1/loo.pdf}{Van der Loo (2014)} for an extensive description and references. The review papers of Navarro (2001) and Boytsov (2011) provide excellent technical overviews of respectively online and offline string matching algorithms. The \bold{Hamming distance} (\code{method='hamming'}) counts the number of character substitutions that turns \code{b} into \code{a}. If \code{a} and \code{b} have different number of characters the distance is \code{Inf}. The \bold{Levenshtein distance} (\code{method='lv'}) counts the number of deletions, insertions and substitutions necessary to turn \code{b} into \code{a}. This method is equivalent to \code{R}'s native \code{\link[utils]{adist}} function. The \bold{Optimal String Alignment distance} (\code{method='osa'}) is like the Levenshtein distance but also allows transposition of adjacent characters. Here, each substring may be edited only once. (For example, a character cannot be transposed twice to move it forward in the string). The \bold{full Damerau-Levenshtein distance} (\code{method='dl'}) is like the optimal string alignment distance except that it allows for multiple edits on substrings. The \bold{longest common substring} (method='lcs') is defined as the longest string that can be obtained by pairing characters from \code{a} and \code{b} while keeping the order of characters intact. The \bold{lcs-distance} is defined as the number of unpaired characters. The distance is equivalent to the edit distance allowing only deletions and insertions, each with weight one. A \bold{\eqn{q}-gram} (method='qgram') is a subsequence of \eqn{q} \emph{consecutive} characters of a string. If \eqn{x} (\eqn{y}) is the vector of counts of \eqn{q}-gram occurrences in \code{a} (\code{b}), the \bold{\eqn{q}-gram distance} is given by the sum over the absolute differences \eqn{|x_i-y_i|}. The computation is aborted when \code{q} is is larger than the length of any of the strings. In that case \code{Inf} is returned. The \bold{cosine distance} (method='cosine') is computed as \eqn{1-x\cdot y/(\|x\|\|y\|)}, where \eqn{x} and \eqn{y} were defined above. Let \eqn{X} be the set of unique \eqn{q}-grams in \code{a} and \eqn{Y} the set of unique \eqn{q}-grams in \code{b}. The \bold{Jaccard distance} (\code{method='jaccard'}) is given by \eqn{1-|X\cap Y|/|X\cup Y|}. The \bold{Jaro distance} (\code{method='jw'}, \code{p=0}), is a number between 0 (exact match) and 1 (completely dissimilar) measuring dissimilarity between strings. It is defined to be 0 when both strings have length 0, and 1 when there are no character matches between \code{a} and \code{b}. Otherwise, the Jaro distance is defined as \eqn{1-(1/3)(w_1m/|a| + w_2m/|b| + w_3(m-t)/m)}. Here,\eqn{|a|} indicates the number of characters in \code{a}, \eqn{m} is the number of character matches and \eqn{t} the number of transpositions of matching characters. The \eqn{w_i} are weights associated with the characters in \code{a}, characters in \code{b} and with transpositions. A character \eqn{c} of \code{a} \emph{matches} a character from \code{b} when \eqn{c} occurs in \code{b}, and the index of \eqn{c} in \code{a} differs less than \eqn{\max(|a|,|b|)/2 -1} (where we use integer division) from the index of \eqn{c} in \code{b}. Two matching characters are transposed when they are matched but they occur in different order in string \code{a} and \code{b}. The \bold{Jaro-Winkler distance} (\code{method=jw}, \code{0