triebeard/ 0000755 0001762 0000144 00000000000 12750473341 012221 5 ustar ligges users triebeard/inst/ 0000755 0001762 0000144 00000000000 12750461372 013177 5 ustar ligges users triebeard/inst/doc/ 0000755 0001762 0000144 00000000000 12750461372 013744 5 ustar ligges users triebeard/inst/doc/r_radix.html 0000644 0001762 0000144 00000046730 12750461372 016274 0 ustar ligges users
A radix tree, or trie, is a data structure optimised for storing key-value pairs in a way optimised for searching. This makes them very, very good for efficiently matching data against keys, and retrieving the values associated with those keys.
triebeard
provides an implementation of tries for R (and one that can be used in Rcpp development, too, if that’s your thing) so that useRs can take advantage of the fast, efficient and user-friendly matching that they allow.
Suppose we have observations in a dataset that are labelled, with a 2-3 letter code that identifies the facility the sample came from:
labels <- c("AO-1002", "AEO-1004", "AAI-1009", "AFT-1403", "QZ-9065", "QZ-1021", "RF-0901",
"AO-1099", "AFT-1101", "QZ-4933")
We know the facility each code maps to, and we want to be able to map the labels to that - not over 10 entries but over hundreds, or thousands, or hundreds of thousands. Tries are a great way of doing that: we treat the codes as keys and the full facility names as values. So let’s make a trie to do this matching, and then, well, match:
library(triebeard)
trie <- trie(keys = c("AO", "AEO", "AAI", "AFT", "QZ", "RF"),
values = c("Audobon", "Atlanta", "Ann Arbor", "Austin", "Queensland", "Raleigh"))
longest_match(trie = trie, to_match = labels)
[1] "Audobon" "Atlanta" "Ann Arbor" "Austin" "Queensland" "Queensland" "Raleigh" "Audobon" "Austin"
[10] "Queensland"
This pulls out, for each label, the trie value where the associated key has the longest prefix-match to the label. We can also just grab all the values where the key starts with, say, A:
prefix_match(trie = trie, to_match = "A")
[[1]]
[1] "Ann Arbor" "Atlanta" "Austin" "Audobon"
And finally if we want we can match very, very fuzzily using “greedy” matching:
greedy_match(trie = trie, to_match = "AO")
[[1]]
[1] "Ann Arbor" "Atlanta" "Austin" "Audobon"
These operations are very, very efficient. If we use longest-match as an example, since that’s the most useful thing, with a one-million element vector of things to match against:
library(triebeard)
library(microbenchmark)
trie <- trie(keys = c("AO", "AEO", "AAI", "AFT", "QZ", "RF"),
values = c("Audobon", "Atlanta", "Ann Arbor", "Austin", "Queensland", "Raleigh"))
labels <- rep(c("AO-1002", "AEO-1004", "AAI-1009", "AFT-1403", "QZ-9065", "QZ-1021", "RF-0901",
"AO-1099", "AFT-1101", "QZ-4933"), 100000)
microbenchmark({longest_match(trie = trie, to_match = labels)})
Unit: milliseconds
expr min lq mean median uq max neval
{ longest_match(trie = trie, to_match = labels) } 284.6457 285.5902 289.5342 286.8775 288.4564 327.3878 100
I think we can call <300 milliseconds for a million matches against an entire set of possible values pretty fast.
There’s always the possibility that (horror of horrors) you’ll have to add or remove entries from the trie. Fear not; you can do just that with trie_add
and trie_remove
respectively, both of which silently modify the trie they’re provided with to add or remove whatever key-value pairs you provide:
to_match = "198.0.0.1"
trie_inst <- trie(keys = "197", values = "fake range")
longest_match(trie_inst, to_match)
[1] NA
trie_add(trie_inst, keys = "198", values = "home range")
longest_match(trie_inst, to_match)
[1] "home range"
trie_remove(trie_inst, keys = "198")
longest_match(trie_inst, to_match)
[1] NA
You can also extract information from tries without using them. dim
, str
, print
and length
all work for tries, and you can use get_keys(trie)
and get_values(trie)
to extract, respectively, the keys and values from a trie object.
In addition, you can also coerce tries into other R data structures, specifically lists and data.frames:
trie <- trie(keys = c("AO", "AEO", "AAI", "AFT", "QZ", "RF"),
values = c("Audobon", "Atlanta", "Ann Arbor", "Austin", "Queensland", "Raleigh"))
str(as.data.frame(trie))
'data.frame': 6 obs. of 2 variables:
$ keys : chr "AAI" "AEO" "AFT" "AO" ...
$ values: chr "Ann Arbor" "Atlanta" "Austin" "Audobon" ...
str(as.list(trie))
List of 2
$ keys : chr [1:6] "AAI" "AEO" "AFT" "AO" ...
$ values: chr [1:6] "Ann Arbor" "Atlanta" "Austin" "Audobon" ...
If you have ideas for other trie-like structures, or functions that would be useful with these tries, the best approach is to either request it or add it!
A radix tree is a data structure optimised for storing key-value pairs in a way optimised for searching. This makes them very, very good for efficiently matching data against keys, and retrieving the values associated with those keys.
triebeard
provides an implementation of radix trees for Rcpp (and also for use directly in R). To start using radix trees in your Rcpp development, simply modify your C++ file to include at the top:
//[[Rcpp::depends(triebeard)]]
#include <radix.h>
Trees are constructed using the syntax:
radix_tree<type1, type2> radix;
Where type
represents the type of the keys (for example, std::string
) and type2
the type of the values.
Radix trees can have any scalar type as keys, although strings are most typical; they can also have any scalar type for values. Once you’ve constructed a tree, new entries can be added in a very R-like way: radix[new_key] = new_value;
. Entries can also be removed, with radix.erase(key)
.
We then move on to the fun bit: matching! As mentioned, radix trees are really good for matching arbitrary values against keys (well, keys of the same type) and retrieving the associated values.
There are three types of supported matching; longest, prefix, and greedy. Longest does exactly what it says on the tin: it finds the key-value pair where the longest initial part of the key matches the arbitrary value:
radix_tree<std::string, std::string> radix;
radix["turnin"] = "entry the first";
radix["turin"] = "entry the second";
radix_tree<std::string, std::string>::iterator it;
it = radix.longest_match("turing");
if(it = radix.end()){
printf("No match was found :(");
} else {
std::string result = "Key of longest match: " + it->first + " , value of longest match: " + it->second;
}
Prefix matching provides all trie entries where the value-to-match is a prefix of the key:
radix_tree<std::string, std::string> radix;
radix["turnin"] = "entry the first";
radix["turin"] = "entry the second";
std::vector<radix_tree<std::string, std::string>::iterator> vec;
std::vector<radix_tree<std::string, std::string>::iterator>::iterator it;
it = radix.prefix_match("tur");
if(it == vec.end()){
printf("No match was found :(");
} else {
for (it = vec.begin(); it != vec.end(); ++it) {
std::string result = "Key of a prefix match: " + it->first + " , value of a prefix match: " + it->second;
}
}
Greedy matching matches very, very fuzzily (a value of ‘bring’, for example, will match ‘blind’, ‘bind’ and ‘binary’) and, syntactically, looks exactly the same as prefix-matching, albeit with radix.greedy_match()
instead of radix.prefix_match()
.
If you have ideas for other trie-like structures, or functions that would be useful with these tries, the best approach is to either request it or add it!