urltools/ 0000755 0001762 0000144 00000000000 13230661300 012127 5 ustar ligges users urltools/inst/ 0000755 0001762 0000144 00000000000 13230557631 013117 5 ustar ligges users urltools/inst/doc/ 0000755 0001762 0000144 00000000000 13230557631 013664 5 ustar ligges users urltools/inst/doc/urltools.Rmd 0000644 0001762 0000144 00000017316 13230556700 016217 0 ustar ligges users
## Elegant URL handling with urltools
URLs are treated, by base R, as nothing more than components of a data retrieval process: they exist
to create connections to retrieve datasets. This is an essential feature for the language to have,
but it also means that URL handlers are designed for situations where URLs *get* you to the data -
not situations where URLs *are* the data.
There is no support for encoding or decoding URLs en-masse, and no support for parsing and
interpreting them. `urltools` provides this support!
### URL encoding and decoding
Base R provides two functions - URLdecode
and URLencode
- for taking percentage-encoded
URLs and turning them into regular strings, or vice versa. As discussed, these are primarily designed to
enable connections, and so they have several inherent limitations, including a lack of vectorisation, that
make them unsuitable for large datasets.
Not only are they not vectorised, they also have several particularly idiosyncratic bugs and limitations:
URLdecode
, for example, breaks if the decoded value is out of range:
```{r, eval=FALSE}
URLdecode("test%gIL")
Error in rawToChar(out) : embedded nul in string: '\0L'
In addition: Warning message:
In URLdecode("%gIL") : out-of-range values treated as 0 in coercion to raw
```
URLencode, on the other hand, encodes slashes on its most strict setting - without
paying attention to where those slashes *are*: if we attempt to URLencode an entire URL, we get:
```{r, eval=FALSE}
URLencode("https://en.wikipedia.org/wiki/Article", reserved = TRUE)
[1] "https%3a%2f%2fen.wikipedia.org%2fwiki%2fArticle"
```
That's a completely unusable URL (or ewRL, if you will).
urltools replaces both functions with url\_decode
and url\_encode
respectively:
```{r, eval=FALSE}
library(urltools)
url_decode("test%gIL")
[1] "test"
url_encode("https://en.wikipedia.org/wiki/Article")
[1] "https://en.wikipedia.org%2fwiki%2fArticle"
```
As you can see, url\_decode
simply excludes out-of-range characters from consideration, while url\_encode
detects characters that make up part of the URLs scheme, and leaves them unencoded. Both are extremely fast; with `urltools`, you can
decode a vector of 1,000,000 URLs in 0.9 seconds.
Alongside these, we have functions for encoding and decoding the 'punycode' format of URLs - ones that are designed to be internationalised and have unicode characters in them. These also take one argument, a vector of URLs, and can be found at `puny_encode` and `puny_decode` respectively.
### URL parsing
Once you've got your nicely decoded (or encoded) URLs, it's time to do something with them - and, most of the time,
you won't actually care about most of the URL. You'll want to look at the scheme, or the domain, or the path,
but not the entire thing as one string.
The solution is url_parse
, which takes a URL and breaks it out into its [RfC 3986](http://www.ietf.org/rfc/rfc3986.txt) components: scheme, domain, port, path, query string and fragment identifier. This is,
again, fully vectorised, and can happily be run over hundreds of thousands of URLs, rapidly processing them. The
results are provided as a data.frame, since most people use data.frames to store data.
```{r, eval=FALSE}
> parsed_address <- url_parse("https://en.wikipedia.org/wiki/Article")
> str(parsed_address)
'data.frame': 1 obs. of 6 variables:
$ scheme : chr "https"
$ domain : chr "en.wikipedia.org"
$ port : chr NA
$ path : chr "wiki/Article"
$ parameter: chr NA
$ fragment : chr NA
```
We can also perform the opposite of this operation with `url_compose`:
```{r, eval=FALSE}
> url_compose(parsed_address)
[1] "https://en.wikipedia.org/wiki/article"
```
### Getting/setting URL components
With the inclusion of a URL parser, we suddenly have the opportunity for lubridate-style component getting
and setting. Syntax is identical to that of `lubridate`, but uses URL components as function names.
```{r, eval=FALSE}
url <- "https://en.wikipedia.org/wiki/Article"
scheme(url)
"https"
scheme(url) <- "ftp"
url
"ftp://en.wikipedia.org/wiki/Article"
```
Fields that can be extracted or set are scheme
, domain
, port
, path
,
parameters
and fragment
.
### Suffix and TLD extraction
Once we've extracted a domain from a URL with `domain` or `url_parse`, we can identify which bit is the domain name, and which
bit is the suffix:
```{r, eval=FALSE}
> url <- "https://en.wikipedia.org/wiki/Article"
> domain_name <- domain(url)
> domain_name
[1] "en.wikipedia.org"
> str(suffix_extract(domain_name))
'data.frame': 1 obs. of 4 variables:
$ host : chr "en.wikipedia.org"
$ subdomain: chr "en"
$ domain : chr "wikipedia"
$ suffix : chr "org"
```
This relies on an internal database of public suffixes, accessible at `suffix_dataset` - we recognise, though,
that this dataset may get a bit out of date, so you can also pass the results of the `suffix_refresh` function,
which retrieves an updated dataset, to `suffix_extract`:
```{r, eval=FALSE}
domain_name <- domain("https://en.wikipedia.org/wiki/Article")
updated_suffixes <- suffix_refresh()
suffix_extract(domain_name, updated_suffixes)
```
We can do the same thing with top-level domains, with precisely the same setup, except the functions and datasets are `tld_refresh`, `tld_extract` and `tld_dataset`.
In the other direction we have `host_extract`, which retrieves, well, the host! If the URL has subdomains, it'll be the
lowest-level subdomain. If it doesn't, it'll be the actual domain name, without the suffixes:
```{r, eval=FALSE}
domain_name <- domain("https://en.wikipedia.org/wiki/Article")
host_extract(domain_name)
```
### Query manipulation
Once a URL is parsed, it's sometimes useful to get the value associated with a particular query parameter. As
an example, take the URL `http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json`. What
pageID is being used? What is the export format? We can find out with `param_get`.
```{r, eval=FALSE}
> str(param_get(urls = "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json",
parameter_names = c("pageid","export")))
'data.frame': 1 obs. of 2 variables:
$ pageid: chr "1023"
$ export: chr "json"
```
This isn't the only function for query manipulation; we can also dynamically modify the values a particular parameter
might have, or strip them out entirely.
To modify the values, we use `param_set`:
```{r, eval=FALSE}
url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
url <- param_set(url, key = "pageid", value = "12")
url
# [1] "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=12&export=json"
```
As you can see this works pretty well; it even works in situations where the URL doesn't *have* a query yet:
```{r, eval=FALSE}
url <- "http://en.wikipedia.org/wiki/api.php"
url <- param_set(url, key = "pageid", value = "12")
url
# [1] "http://en.wikipedia.org/wiki/api.php?pageid=12"
```
On the other hand we might have a parameter we just don't want any more - that can be handled with `param_remove`, which can
take multiple parameters as well as multiple URLs:
```{r, eval=FALSE}
url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
url <- param_remove(url, keys = c("action","export"))
url
# [1] "http://en.wikipedia.org/wiki/api.php?pageid=1023"
```
### Other URL handlers
If you have ideas for other URL handlers that would make your data processing easier, the best approach
is to either [request it](https://github.com/Ironholds/urltools/issues) or [add it](https://github.com/Ironholds/urltools/pulls)!
urltools/inst/doc/urltools.html 0000644 0001762 0000144 00000053633 13230557631 016447 0 ustar ligges users
Elegant URL handling with urltools
Elegant URL handling with urltools
URLs are treated, by base R, as nothing more than components of a data retrieval process: they exist
to create connections to retrieve datasets. This is an essential feature for the language to have,
but it also means that URL handlers are designed for situations where URLs get you to the data -
not situations where URLs are the data.
There is no support for encoding or decoding URLs en-masse, and no support for parsing and
interpreting them. urltools
provides this support!
URL encoding and decoding
Base R provides two functions - URLdecode
and URLencode
- for taking percentage-encoded
URLs and turning them into regular strings, or vice versa. As discussed, these are primarily designed to
enable connections, and so they have several inherent limitations, including a lack of vectorisation, that
make them unsuitable for large datasets.
Not only are they not vectorised, they also have several particularly idiosyncratic bugs and limitations:
URLdecode
, for example, breaks if the decoded value is out of range:
URLdecode("test%gIL")
Error in rawToChar(out) : embedded nul in string: '\0L'
In addition: Warning message:
In URLdecode("%gIL") : out-of-range values treated as 0 in coercion to raw
URLencode, on the other hand, encodes slashes on its most strict setting - without
paying attention to where those slashes are: if we attempt to URLencode an entire URL, we get:
URLencode("https://en.wikipedia.org/wiki/Article", reserved = TRUE)
[1] "https%3a%2f%2fen.wikipedia.org%2fwiki%2fArticle"
That's a completely unusable URL (or ewRL, if you will).
urltools replaces both functions with url_decode
and url_encode
respectively:
library(urltools)
url_decode("test%gIL")
[1] "test"
url_encode("https://en.wikipedia.org/wiki/Article")
[1] "https://en.wikipedia.org%2fwiki%2fArticle"
As you can see, url_decode
simply excludes out-of-range characters from consideration, while url_encode
detects characters that make up part of the URLs scheme, and leaves them unencoded. Both are extremely fast; with urltools
, you can
decode a vector of 1,000,000 URLs in 0.9 seconds.
Alongside these, we have functions for encoding and decoding the 'punycode' format of URLs - ones that are designed to be internationalised and have unicode characters in them. These also take one argument, a vector of URLs, and can be found at puny_encode
and puny_decode
respectively.
URL parsing
Once you've got your nicely decoded (or encoded) URLs, it's time to do something with them - and, most of the time,
you won't actually care about most of the URL. You'll want to look at the scheme, or the domain, or the path,
but not the entire thing as one string.
The solution is url_parse
, which takes a URL and breaks it out into its RfC 3986 components: scheme, domain, port, path, query string and fragment identifier. This is,
again, fully vectorised, and can happily be run over hundreds of thousands of URLs, rapidly processing them. The
results are provided as a data.frame, since most people use data.frames to store data.
> parsed_address <- url_parse("https://en.wikipedia.org/wiki/Article")
> str(parsed_address)
'data.frame': 1 obs. of 6 variables:
$ scheme : chr "https"
$ domain : chr "en.wikipedia.org"
$ port : chr NA
$ path : chr "wiki/Article"
$ parameter: chr NA
$ fragment : chr NA
We can also perform the opposite of this operation with url_compose
:
> url_compose(parsed_address)
[1] "https://en.wikipedia.org/wiki/article"
Getting/setting URL components
With the inclusion of a URL parser, we suddenly have the opportunity for lubridate-style component getting
and setting. Syntax is identical to that of lubridate
, but uses URL components as function names.
url <- "https://en.wikipedia.org/wiki/Article"
scheme(url)
"https"
scheme(url) <- "ftp"
url
"ftp://en.wikipedia.org/wiki/Article"
Fields that can be extracted or set are scheme
, domain
, port
, path
,
parameters
and fragment
.
Suffix and TLD extraction
Once we've extracted a domain from a URL with domain
or url_parse
, we can identify which bit is the domain name, and which
bit is the suffix:
> url <- "https://en.wikipedia.org/wiki/Article"
> domain_name <- domain(url)
> domain_name
[1] "en.wikipedia.org"
> str(suffix_extract(domain_name))
'data.frame': 1 obs. of 4 variables:
$ host : chr "en.wikipedia.org"
$ subdomain: chr "en"
$ domain : chr "wikipedia"
$ suffix : chr "org"
This relies on an internal database of public suffixes, accessible at suffix_dataset
- we recognise, though,
that this dataset may get a bit out of date, so you can also pass the results of the suffix_refresh
function,
which retrieves an updated dataset, to suffix_extract
:
domain_name <- domain("https://en.wikipedia.org/wiki/Article")
updated_suffixes <- suffix_refresh()
suffix_extract(domain_name, updated_suffixes)
We can do the same thing with top-level domains, with precisely the same setup, except the functions and datasets are tld_refresh
, tld_extract
and tld_dataset
.
In the other direction we have host_extract
, which retrieves, well, the host! If the URL has subdomains, it'll be the
lowest-level subdomain. If it doesn't, it'll be the actual domain name, without the suffixes:
domain_name <- domain("https://en.wikipedia.org/wiki/Article")
host_extract(domain_name)
Query manipulation
Once a URL is parsed, it's sometimes useful to get the value associated with a particular query parameter. As
an example, take the URL http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json
. What
pageID is being used? What is the export format? We can find out with param_get
.
> str(param_get(urls = "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json",
parameter_names = c("pageid","export")))
'data.frame': 1 obs. of 2 variables:
$ pageid: chr "1023"
$ export: chr "json"
This isn't the only function for query manipulation; we can also dynamically modify the values a particular parameter
might have, or strip them out entirely.
To modify the values, we use param_set
:
url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
url <- param_set(url, key = "pageid", value = "12")
url
# [1] "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=12&export=json"
As you can see this works pretty well; it even works in situations where the URL doesn't have a query yet:
url <- "http://en.wikipedia.org/wiki/api.php"
url <- param_set(url, key = "pageid", value = "12")
url
# [1] "http://en.wikipedia.org/wiki/api.php?pageid=12"
On the other hand we might have a parameter we just don't want any more - that can be handled with param_remove
, which can
take multiple parameters as well as multiple URLs:
url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
url <- param_remove(url, keys = c("action","export"))
url
# [1] "http://en.wikipedia.org/wiki/api.php?pageid=1023"
Other URL handlers
If you have ideas for other URL handlers that would make your data processing easier, the best approach
is to either request it or add it!
urltools/inst/doc/urltools.R 0000644 0001762 0000144 00000006642 13230557631 015702 0 ustar ligges users ## ---- eval=FALSE---------------------------------------------------------
# URLdecode("test%gIL")
# Error in rawToChar(out) : embedded nul in string: '\0L'
# In addition: Warning message:
# In URLdecode("%gIL") : out-of-range values treated as 0 in coercion to raw
## ---- eval=FALSE---------------------------------------------------------
# URLencode("https://en.wikipedia.org/wiki/Article", reserved = TRUE)
# [1] "https%3a%2f%2fen.wikipedia.org%2fwiki%2fArticle"
## ---- eval=FALSE---------------------------------------------------------
# library(urltools)
# url_decode("test%gIL")
# [1] "test"
# url_encode("https://en.wikipedia.org/wiki/Article")
# [1] "https://en.wikipedia.org%2fwiki%2fArticle"
## ---- eval=FALSE---------------------------------------------------------
# > parsed_address <- url_parse("https://en.wikipedia.org/wiki/Article")
# > str(parsed_address)
# 'data.frame': 1 obs. of 6 variables:
# $ scheme : chr "https"
# $ domain : chr "en.wikipedia.org"
# $ port : chr NA
# $ path : chr "wiki/Article"
# $ parameter: chr NA
# $ fragment : chr NA
## ---- eval=FALSE---------------------------------------------------------
# > url_compose(parsed_address)
# [1] "https://en.wikipedia.org/wiki/article"
## ---- eval=FALSE---------------------------------------------------------
# url <- "https://en.wikipedia.org/wiki/Article"
# scheme(url)
# "https"
# scheme(url) <- "ftp"
# url
# "ftp://en.wikipedia.org/wiki/Article"
## ---- eval=FALSE---------------------------------------------------------
# > url <- "https://en.wikipedia.org/wiki/Article"
# > domain_name <- domain(url)
# > domain_name
# [1] "en.wikipedia.org"
# > str(suffix_extract(domain_name))
# 'data.frame': 1 obs. of 4 variables:
# $ host : chr "en.wikipedia.org"
# $ subdomain: chr "en"
# $ domain : chr "wikipedia"
# $ suffix : chr "org"
## ---- eval=FALSE---------------------------------------------------------
# domain_name <- domain("https://en.wikipedia.org/wiki/Article")
# updated_suffixes <- suffix_refresh()
# suffix_extract(domain_name, updated_suffixes)
## ---- eval=FALSE---------------------------------------------------------
# domain_name <- domain("https://en.wikipedia.org/wiki/Article")
# host_extract(domain_name)
## ---- eval=FALSE---------------------------------------------------------
# > str(param_get(urls = "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json",
# parameter_names = c("pageid","export")))
# 'data.frame': 1 obs. of 2 variables:
# $ pageid: chr "1023"
# $ export: chr "json"
## ---- eval=FALSE---------------------------------------------------------
# url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
# url <- param_set(url, key = "pageid", value = "12")
# url
# # [1] "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=12&export=json"
## ---- eval=FALSE---------------------------------------------------------
# url <- "http://en.wikipedia.org/wiki/api.php"
# url <- param_set(url, key = "pageid", value = "12")
# url
# # [1] "http://en.wikipedia.org/wiki/api.php?pageid=12"
## ---- eval=FALSE---------------------------------------------------------
# url <- "http://en.wikipedia.org/wiki/api.php?action=parse&pageid=1023&export=json"
# url <- param_remove(url, keys = c("action","export"))
# url
# # [1] "http://en.wikipedia.org/wiki/api.php?pageid=1023"
urltools/tests/ 0000755 0001762 0000144 00000000000 13230556700 013300 5 ustar ligges users urltools/tests/testthat.R 0000644 0001762 0000144 00000000074 13230556700 015264 0 ustar ligges users library(testthat)
library(urltools)
test_check("urltools")
urltools/tests/testthat/ 0000755 0001762 0000144 00000000000 13230661300 015131 5 ustar ligges users urltools/tests/testthat/test_parsing.R 0000644 0001762 0000144 00000006405 13230556700 017772 0 ustar ligges users context("URL parsing tests")
test_that("Check parsing identifies each RfC element", {
data <- url_parse("https://www.google.com:80/foo.php?api_params=turnip#ending")
expect_that(ncol(data), equals(6))
expect_that(names(data), equals(c("scheme","domain","port","path","parameter","fragment")))
expect_that(data$scheme[1], equals("https"))
expect_that(data$domain[1], equals("www.google.com"))
expect_that(data$port[1], equals("80"))
expect_that(data$path[1], equals("foo.php"))
expect_that(data$parameter[1], equals("api_params=turnip"))
expect_that(data$fragment[1], equals("ending"))
})
test_that("Check parsing can handle missing elements", {
data <- url_parse("https://www.google.com/foo.php?api_params=turnip#ending")
expect_that(ncol(data), equals(6))
expect_that(names(data), equals(c("scheme","domain","port","path","parameter","fragment")))
expect_that(data$scheme[1], equals("https"))
expect_that(data$domain[1], equals("www.google.com"))
expect_true(is.na(data$port[1]))
expect_that(data$path[1], equals("foo.php"))
expect_that(data$parameter[1], equals("api_params=turnip"))
expect_that(data$fragment[1], equals("ending"))
})
test_that("Parsing does not up and die and misplace the fragment",{
data <- url_parse("http://www.yeastgenome.org/locus/S000005366/overview#protein")
expect_that(data$fragment[1], equals("protein"))
})
test_that("Composing works",{
url <- c("http://foo.bar.baz/qux/", "https://en.wikipedia.org:4000/wiki/api.php")
amended_url <- url_compose(url_parse(url))
expect_that(url, equals(amended_url))
})
test_that("Port handling works", {
url <- "https://en.wikipedia.org:4000/wiki/api.php"
expect_that(port(url), equals("4000"))
expect_that(path(url), equals("wiki/api.php"))
url <- "https://en.wikipedia.org:4000"
expect_that(port(url), equals("4000"))
expect_true(is.na(path(url)))
url <- "https://en.wikipedia.org:4000/"
expect_that(port(url), equals("4000"))
expect_true(is.na(path(url)))
url <- "https://en.wikipedia.org:4000?foo=bar"
expect_that(port(url), equals("4000"))
expect_true(is.na(path(url)))
expect_that(parameters(url), equals("foo=bar"))
})
test_that("Port handling does not break path handling", {
url <- "https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_(logo).jpg"
expect_true(is.na(port(url)))
expect_that(path(url), equals("wiki/File:Vice_City_Public_Radio_(logo).jpg"))
})
test_that("URLs with parameters but no paths work", {
url <- url_parse("http://www.nextpedition.com?inav=menu_travel_nextpedition")
expect_true(url$domain[1] == "www.nextpedition.com")
expect_true(is.na(url$port[1]))
expect_true(is.na(url$path[1]))
expect_true(url$parameter[1] == "inav=menu_travel_nextpedition")
})
test_that("URLs with user credentials drop said credentials when parsing", {
out <- urltools::url_parse("http://foo:bar@97.77.104.22:3128")
testthat::expect_identical(out$domain, "97.77.104.22")
})
test_that("IPv6 URLs can be handled", {
url <- url_parse("tcp://[2607:5300:61:44f::]:8333")
expect_true(url$domain[1] == "2607:5300:61:44f::")
expect_true(url$port[1] == "8333")
})
test_that("URLs with missing paths and parameters, but with fragments, work", {
url <- urltools::url_parse("http://some.website.com#frag")
expect_true(url$fragment[1] == "frag")
}) urltools/tests/testthat/test_parameters.R 0000644 0001762 0000144 00000010434 13230556700 020467 0 ustar ligges users context("Test parameter manipulation")
test_that("Parameter parsing can handle multiple, non-existent and pre-trailing parameters",{
urls <- c("https://www.google.com:80/foo.php?api_params=parsable&this_parameter=selfreferencing&hiphop=awesome",
"https://www.google.com:80/foo.php?api_params=parsable&this_parameter=selfreferencing&hiphop=awesome#foob",
"https://www.google.com:80/foo.php?this_parameter=selfreferencing&hiphop=awesome")
results <- param_get(urls, c("api_params","hiphop"))
expect_that(results[1:2,1], equals(c("parsable","parsable")))
expect_true(is.na(results[3,1]))
})
test_that("Parameter parsing works where the parameter appears earlier in the URL", {
url <- param_get("www.housetrip.es/tos-de-vacaciones/geo?from=01/04/2015&guests=4&to=05/04/2015","to")
expect_that(ncol(url), equals(1))
expect_that(url$to[1], equals("05/04/2015"))
})
test_that("Default argument will get all parameter keys", {
url <- param_get("www.housetrip.es/tos-de-vacaciones/geo?from=01/04/2015&guests=4&to=05/04/2015")
df <- data.frame("from"="01/04/2015", "guests"="4", "to"="05/04/2015", stringsAsFactors = FALSE)
testthat::expect_equivalent(url, df)
})
test_that("vectorized get all keys produces NA appropriately", {
urls <- c("www.housetrip.es/tos-de-vacaciones/geo?guests=4&to=05/04/2015",
"www.housetrip.es/tos-de-vacaciones/geo?from=01/04/2015&guests=8")
pars <- param_get(urls)
df <- data.frame(stringsAsFactors = FALSE,
from = c(NA, "01/04/2015"),
guests = c("4", "8"),
to = c("05/04/2015", NA))
expect_equivalent(pars, df)
})
test_that("parameter get deals with escaped ampersands and fragments in field values", {
par <- param_get("http://host/query?foo=bar&baz&=nonsense#a=1&b=2", "amp")
expect_equivalent(par, data.frame(amp="nonsense", stringsAsFactors=FALSE))
par <- param_get("http://host/query?foo=bar&baz&=nonsense#a=1&b=2")
expect_equivalent(par, data.frame(amp="nonsense", foo="bar&baz", stringsAsFactors=FALSE))
})
test_that("Setting parameter values works", {
expect_true(param_set("https://en.wikipedia.org/wiki/api.php", "baz", "quorn") ==
"https://en.wikipedia.org/wiki/api.php?baz=quorn")
expect_true(param_set("https://en.wikipedia.org/wiki/api.php?foo=bar&baz=qux", "baz", "quorn") ==
"https://en.wikipedia.org/wiki/api.php?foo=bar&baz=quorn")
expect_true(param_set("https://en.wikipedia.org/wiki/api.php?foo=bar", "baz", "quorn") ==
"https://en.wikipedia.org/wiki/api.php?foo=bar&baz=quorn")
})
test_that("Setting parameter values quietly fails with NA components", {
url <- "https://en.wikipedia.org/api.php?action=query"
expect_identical(url, param_set(url, "action", NA_character_))
expect_true(is.na(param_set(NA_character_, "action", "foo")))
expect_identical(url, param_set(url, NA_character_, "pageinfo"))
})
test_that("Setting parameter values works with partially-duplicative keys", {
url <- "https://en.wikipedia.org/api.php"
url <- param_set(url, "foo", "bar")
url <- param_set(url, "oo", "baz")
testthat::expect_equal(url,
"https://en.wikipedia.org/api.php?foo=bar&oo=baz")
})
test_that("Removing parameter entries quietly fails with NA components", {
url <- "https://en.wikipedia.org/api.php?action=query"
expect_identical(url, param_remove(url, "foo"))
expect_true(is.na(param_remove(NA_character_, "action")))
})
test_that("Removing parameter keys works", {
expect_true(param_remove("https://en.wikipedia.org/api.php?baz=qux", "baz") ==
"https://en.wikipedia.org/api.php")
})
test_that("Removing parameter keys works when there are multiple parameters in the URL", {
expect_true(param_remove("https://en.wikipedia.org/api.php?baz=qux&foo=bar", "baz") ==
"https://en.wikipedia.org/api.php?foo=bar")
})
test_that("Removing parameter keys works when there are multiple parameters to remove", {
expect_true(param_remove("https://en.wikipedia.org/api.php?baz=qux&foo=bar", c("baz","foo")) ==
"https://en.wikipedia.org/api.php")
})
test_that("Removing parameter keys works when there is no query", {
expect_true(param_remove("https://en.wikipedia.org/api.php", "baz") ==
"https://en.wikipedia.org/api.php")
})
urltools/tests/testthat/test_encoding.R 0000644 0001762 0000144 00000002055 13230556700 020112 0 ustar ligges users context("URL encoding tests")
test_that("Check encoding doesn't encode the scheme", {
expect_that(url_encode("https://"), equals("https://"))
})
test_that("Check encoding does does not encode pre-path slashes", {
expect_that(url_encode("https://foo.org/bar/"), equals("https://foo.org/bar%2f"))
})
test_that("Check encoding can handle NAs", {
expect_that(url_encode(c("https://foo.org/bar/", NA)), equals(c("https://foo.org/bar%2f", NA)))
})
test_that("Check decoding can handle NAs", {
expect_that(url_decode(c("https://foo.org/bar%2f", NA)), equals(c("https://foo.org/bar/", NA)))
})
# Add comment for windows trickery
test_that("Check decoding and encoding are equivalent", {
if(.Platform$OS.type == "unix"){
url <- "Hinrichtung_auf_dem_Altst%c3%a4dter_Ring.JPG%2f120px-Hinrichtung_auf_dem_Altst%c3%a4dter_Ring.JPG"
decoded_url <- "Hinrichtung_auf_dem_Altstädter_Ring.JPG/120px-Hinrichtung_auf_dem_Altstädter_Ring.JPG"
expect_that((url_decode(url)), equals((decoded_url)))
expect_that((url_encode(decoded_url)), equals((url)))
}
}) urltools/tests/testthat/test_suffixes.R 0000644 0001762 0000144 00000010667 13230556700 020170 0 ustar ligges users context("Test suffix extraction")
test_that("Suffix extraction works with simple domains",{
result <- suffix_extract("en.wikipedia.org")
expect_that(ncol(result), equals(4))
expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
expect_that(nrow(result), equals(1))
expect_that(result$subdomain[1], equals("en"))
expect_that(result$domain[1], equals("wikipedia"))
expect_that(result$suffix[1], equals("org"))
})
test_that("Suffix extraction works with multiple domains",{
result <- suffix_extract(c("en.wikipedia.org","en.wikipedia.org"))
expect_that(ncol(result), equals(4))
expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
expect_that(nrow(result), equals(2))
expect_that(result$subdomain[1], equals("en"))
expect_that(result$domain[1], equals("wikipedia"))
expect_that(result$suffix[1], equals("org"))
expect_that(result$subdomain[2], equals("en"))
expect_that(result$domain[2], equals("wikipedia"))
expect_that(result$suffix[2], equals("org"))
})
test_that("Suffix extraction works when the domain is the same as the suffix",{
result <- suffix_extract(c("googleapis.com", "myapi.googleapis.com"))
expect_that(ncol(result), equals(4))
expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
expect_that(nrow(result), equals(2))
expect_equal(result$subdomain[1], NA_character_)
expect_equal(result$domain[1], NA_character_)
expect_equal(result$suffix[1], "googleapis.com")
expect_equal(result$subdomain[2], NA_character_)
expect_equal(result$domain[2], "myapi")
expect_equal(result$suffix[2], "googleapis.com")
})
test_that("Suffix extraction works where domains/suffixes overlap", {
result <- suffix_extract(domain("http://www.converse.com")) # could be se.com or .com
expect_equal(result$subdomain[1], "www")
expect_equal(result$domain[1], "converse")
expect_equal(result$suffix[1], "com")
})
test_that("Suffix extraction works when the domain matches a wildcard suffix",{
result <- suffix_extract(c("banana.bd", "banana.boat.bd"))
expect_that(ncol(result), equals(4))
expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
expect_that(nrow(result), equals(2))
expect_equal(result$subdomain[1], NA_character_)
expect_equal(result$domain[1], NA_character_)
expect_equal(result$suffix[1], "banana.bd")
expect_equal(result$subdomain[2], NA_character_)
expect_equal(result$domain[2], "banana")
expect_equal(result$suffix[2], "boat.bd")
})
test_that("Suffix extraction works when the domain matches a wildcard suffix and has subdomains",{
result <- suffix_extract(c("foo.bar.banana.bd"))
expect_that(ncol(result), equals(4))
expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
expect_that(nrow(result), equals(1))
expect_equal(result$subdomain[1], "foo")
expect_equal(result$domain[1], "bar")
expect_equal(result$suffix[1], "banana.bd")
})
test_that("Suffix extraction works with new suffixes",{
result <- suffix_extract("en.wikipedia.org", suffix_refresh())
expect_that(ncol(result), equals(4))
expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
expect_that(nrow(result), equals(1))
expect_that(result$subdomain[1], equals("en"))
expect_that(result$domain[1], equals("wikipedia"))
expect_that(result$suffix[1], equals("org"))
})
test_that("Suffix extraction works with an arbitrary suffixes database (to ensure it is loading it)",{
result <- suffix_extract(c("is-this-a.bananaboat", "en.wikipedia.org"), data.frame(suffixes = "bananaboat"))
expect_that(ncol(result), equals(4))
expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
expect_that(nrow(result), equals(2))
expect_equal(result$subdomain[1], NA_character_)
expect_equal(result$domain[1], "is-this-a")
expect_equal(result$suffix[1], "bananaboat")
expect_equal(result$subdomain[2], NA_character_)
expect_equal(result$domain[2], NA_character_)
expect_equal(result$suffix[2], NA_character_)
})
test_that("Suffix extraction is back to normal using the internal database when it receives suffixes=NULL",{
result <- suffix_extract("en.wikipedia.org")
expect_that(ncol(result), equals(4))
expect_that(names(result), equals(c("host","subdomain","domain","suffix")))
expect_that(nrow(result), equals(1))
expect_that(result$subdomain[1], equals("en"))
expect_that(result$domain[1], equals("wikipedia"))
expect_that(result$suffix[1], equals("org"))
}) urltools/tests/testthat/test_puny.R 0000644 0001762 0000144 00000003614 13230556700 017321 0 ustar ligges users context("Check punycode handling")
testthat::test_that("Simple punycode domain encoding works", {
testthat::expect_identical(puny_encode(enc2utf8("https://www.b\u00FCcher.com/foo")),
"https://www.xn--bcher-kva.com/foo")
})
testthat::test_that("Punycode domain encoding works with fragmentary paths", {
testthat::expect_identical(puny_encode(enc2utf8("https://www.b\u00FCcher.com/")),
"https://www.xn--bcher-kva.com/")
})
testthat::test_that("Punycode domain encoding works with ports", {
testthat::expect_identical(puny_encode(enc2utf8("https://www.b\u00FCcher.com:80")),
"https://www.xn--bcher-kva.com:80")
})
testthat::test_that("Punycode domain encoding returns an NA on NAs", {
testthat::expect_true(is.na(puny_encode(NA_character_)))
})
testthat::test_that("Simple punycode domain decoding works", {
testthat::expect_identical(puny_decode("https://www.xn--bcher-kva.com/foo"),
enc2utf8("https://www.b\u00FCcher.com/foo"))
})
testthat::test_that("Punycode domain decoding works with fragmentary paths", {
testthat::expect_identical(puny_decode("https://www.xn--bcher-kva.com/"),
enc2utf8("https://www.b\u00FCcher.com/"))
})
testthat::test_that("Punycode domain decoding works with ports", {
testthat::expect_identical(puny_decode("https://www.xn--bcher-kva.com:80"),
enc2utf8("https://www.b\u00FCcher.com:80"))
})
testthat::test_that("Punycode domain decoding returns an NA on NAs", {
testthat::expect_true(is.na(puny_decode(NA_character_)))
})
testthat::test_that("Punycode domain decoding returns an NA on invalid entries", {
testthat::expect_true(is.na(suppressWarnings(puny_decode("xn--9"))))
})
testthat::test_that("Punycode domain decoding warns on invalid entries", {
testthat::expect_warning(puny_decode("xn--9"))
}) urltools/tests/testthat/test_get_set.R 0000644 0001762 0000144 00000010340 13230556700 017752 0 ustar ligges users context("Component get/set tests")
test_that("Check elements can be retrieved", {
url <- "https://www.google.com:80/foo.php?api_params=turnip#ending"
testthat::expect_equal(scheme(url), "https")
testthat::expect_equal(domain(url), "www.google.com")
testthat::expect_equal(port(url), "80")
testthat::expect_equal(path(url), "foo.php")
testthat::expect_equal(parameters(url), "api_params=turnip")
testthat::expect_equal(fragment(url), "ending")
})
test_that("Check elements can be retrieved with NAs", {
url <- as.character(NA)
testthat::expect_equal(is.na(scheme(url)), TRUE)
testthat::expect_equal(is.na(domain(url)), TRUE)
testthat::expect_equal(is.na(port(url)), TRUE)
testthat::expect_equal(is.na(path(url)), TRUE)
testthat::expect_equal(is.na(parameters(url)), TRUE)
testthat::expect_equal(is.na(fragment(url)), TRUE)
})
test_that("Check elements can be set", {
url <- "https://www.google.com:80/foo.php?api_params=turnip#ending"
scheme(url) <- "http"
testthat::expect_equal(scheme(url), "http")
domain(url) <- "www.wikipedia.org"
testthat::expect_equal(domain(url), "www.wikipedia.org")
port(url) <- "23"
testthat::expect_equal(port(url), "23")
path(url) <- "bar.php"
testthat::expect_equal(path(url), "bar.php")
parameters(url) <- "api_params=manic"
testthat::expect_equal(parameters(url), "api_params=manic")
fragment(url) <- "beginning"
testthat::expect_equal(fragment(url), "beginning")
})
test_that("Check elements can be set with NAs", {
url <- "https://www.google.com:80/"
scheme(url) <- "http"
testthat::expect_equal(scheme(url), "http")
domain(url) <- "www.wikipedia.org"
testthat::expect_equal(domain(url), "www.wikipedia.org")
port(url) <- "23"
testthat::expect_equal(port(url), "23")
path(url) <- "bar.php"
testthat::expect_equal(path(url), "bar.php")
parameters(url) <- "api_params=manic"
testthat::expect_equal(parameters(url), "api_params=manic")
fragment(url) <- "beginning"
testthat::expect_equal(fragment(url), "beginning")
})
test_that("Assigning NA with get will NA a URL", {
url <- "https://www.google.com:80/"
port(url) <- NA_character_
testthat::expect_true(is.na(url))
})
test_that("Removing components with a NULL works", {
url <- "https://www.google.com:80/foo.php?api_params=turnip#ending"
fragment(url) <- NULL
testthat::expect_equal(url,
"https://www.google.com:80/foo.php?api_params=turnip")
parameters(url) <- NULL
testthat::expect_equal(url,
"https://www.google.com:80/foo.php")
path(url) <- NULL
testthat::expect_equal(url,
"https://www.google.com:80")
port(url) <- NULL
testthat::expect_equal(url, "https://www.google.com")
})
test_that("Removing non-removable components throws an error", {
url <- "https://en.wikipedia.org/foo.php"
testthat::expect_error({
scheme(url) <- NULL
})
testthat::expect_error({
domain(url) <- NULL
})
})
test_that("Check multiple elements can be set", {
url <- "https://www.google.com:80/foo.php?api_params=turnip#ending"
url <- c(url, url)
scheme(url) <- c("http", "ftp")
testthat::expect_equal(scheme(url), c("http", "ftp"))
domain(url) <- c("www.wikipedia.org", "google.com")
testthat::expect_equal(domain(url), c("www.wikipedia.org",
"google.com"))
port(url) <- c("23", "86")
testthat::expect_equal(port(url), c("23", "86"))
path(url) <- c("bar.php", "baz.html")
testthat::expect_equal(path(url), c("bar.php", "baz.html"))
parameters(url) <- c("api_params=manic", "api_params=street")
testthat::expect_equal(parameters(url), c("api_params=manic", "api_params=street"))
fragment(url) <- c("beginning", "end")
testthat::expect_equal(fragment(url), c("beginning", "end"))
})
test_that("Check elements can be set with extraneous separators", {
url <- "https://www.wikipedia.org:80/bar.php?api_params=manic#beginning"
backup <- url
scheme(url) <- "https://"
testthat::expect_equal(url, backup)
port(url) <- ":80"
testthat::expect_equal(url, backup)
path(url) <- "/bar.php"
testthat::expect_equal(url, backup)
parameters(url) <- "?api_params=manic"
testthat::expect_equal(url, backup)
fragment(url) <- "#beginning"
testthat::expect_equal(url, backup)
}) urltools/tests/testthat/test_credentials.R 0000644 0001762 0000144 00000003066 13230556700 020624 0 ustar ligges users testthat::context("Test credential extraction and getting")
testthat::test_that("Credentials can be stripped", {
testthat::expect_identical(strip_credentials("http://foo:bar@97.77.104.22:3128"), "http://97.77.104.22:3128")
})
testthat::test_that("Strings with invalidly-placed credentials are left alone", {
testthat::expect_identical(strip_credentials("htt@p://foo:bar97.77.104.22:3128"), "htt@p://foo:bar97.77.104.22:3128")
})
testthat::test_that("Invalid URLs are left alone", {
testthat::expect_identical(strip_credentials("foo:bar@97.77.104.22:3128"), "foo:bar@97.77.104.22:3128")
})
testthat::test_that("Non-objects are left alone", {
testthat::expect_true(is.na(strip_credentials(NA_character_)))
})
testthat::test_that("Credentials can be retrieved", {
data <- get_credentials("http://foo:bar@97.43.5421")
testthat::expect_identical(data$username, "foo")
testthat::expect_identical(data$authentication, "bar")
})
testthat::test_that("Strings with invalidly-placed credentials are left alone", {
data <- get_credentials("htt@p://foo:bar97.77.104.22:3128")
testthat::expect_true(is.na(data$username))
testthat::expect_true(is.na(data$authentication))
})
testthat::test_that("Invalid URLs are left alone", {
data <- get_credentials("foo:bar@97.77.104.22:3128")
testthat::expect_true(is.na(data$username))
testthat::expect_true(is.na(data$authentication))
})
testthat::test_that("Non-objects are left alone", {
data <- get_credentials(NA_character_)
testthat::expect_true(is.na(data$username))
testthat::expect_true(is.na(data$authentication))
}) urltools/tests/testthat/test_memory.R 0000644 0001762 0000144 00000001477 13230556700 017643 0 ustar ligges users context("Avoid regressions around proxy objects")
test_that("Values are correctly disposed from memory",{
memfn <- function(d = NULL){
test_url <- "https://test.com"
if(!is.null(d)){
test_url <- urltools::param_set(test_url, "q" , urltools::url_encode(d))
}
return(test_url)
}
baseurl <- "https://test.com"
expect_equal(memfn(), baseurl)
expect_equal(memfn("blah"), paste0(baseurl, "?q=blah"))
expect_equal(memfn(), baseurl)
})
test_that("Parameters correctly add to output",{
outfn <- function(d = FALSE){
test_url <- "https://test.com"
if(d){
test_url <- urltools::param_set(test_url, "q", urltools::url_encode(d))
}
return(test_url)
}
baseurl <- "https://test.com"
expect_equal(outfn(), baseurl)
expect_equal(outfn(TRUE), paste0(baseurl, "?q=TRUE"))
})
urltools/src/ 0000755 0001762 0000144 00000000000 13230557631 012731 5 ustar ligges users urltools/src/suffix.cpp 0000644 0001762 0000144 00000010656 13230557631 014751 0 ustar ligges users #include
using namespace Rcpp;
std::string string_reverse(std::string x){
std::reverse(x.begin(), x.end());
return x;
}
//[[Rcpp::export]]
CharacterVector reverse_strings(CharacterVector strings){
unsigned int input_size = strings.size();
CharacterVector output(input_size);
for(unsigned int i = 0; i < input_size; i++){
if(strings[i] == NA_STRING){
output[i] = NA_STRING;
} else {
output[i] = string_reverse(Rcpp::as(strings[i]));
}
}
return output;
}
//[[Rcpp::export]]
DataFrame finalise_suffixes(CharacterVector full_domains, CharacterVector suffixes,
LogicalVector wildcard, LogicalVector is_suffix){
unsigned int input_size = full_domains.size();
CharacterVector subdomains(input_size);
CharacterVector domains(input_size);
std::string holding;
size_t domain_location;
for(unsigned int i = 0; i < input_size; i++){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
if(is_suffix[i]){
subdomains[i] = NA_STRING;
domains[i] = NA_STRING;
suffixes[i] = full_domains[i];
} else {
if(suffixes[i] == NA_STRING || suffixes[i].size() == full_domains[i].size()){
subdomains[i] = NA_STRING;
domains[i] = NA_STRING;
} else if(wildcard[i]) {
holding = Rcpp::as(full_domains[i]);
holding = holding.substr(0, ((full_domains[i].size() - suffixes[i].size()) - 1));
domain_location = holding.rfind(".");
if(domain_location == std::string::npos){
domains[i] = NA_STRING;
subdomains[i] = NA_STRING;
suffixes[i] = holding + "." + suffixes[i];
} else {
suffixes[i] = holding.substr(domain_location+1) + "." + suffixes[i];
holding = holding.substr(0, domain_location);
domain_location = holding.rfind(".");
if(domain_location == std::string::npos){
if(holding.size() == 0){
domains[i] = NA_STRING;
} else {
domains[i] = holding;
}
subdomains[i] = NA_STRING;
} else {
domains[i] = holding.substr(domain_location+1);
subdomains[i] = holding.substr(0, domain_location);
}
}
} else {
holding = Rcpp::as(full_domains[i]);
holding = holding.substr(0, ((full_domains[i].size() - suffixes[i].size()) - 1));
domain_location = holding.rfind(".");
if(domain_location == std::string::npos){
subdomains[i] = NA_STRING;
if(holding.size() == 0){
domains[i] = NA_STRING;
} else {
domains[i] = holding;
}
} else {
subdomains[i] = holding.substr(0, domain_location);
domains[i] = holding.substr(domain_location+1);
}
}
}
}
return DataFrame::create(_["host"] = full_domains, _["subdomain"] = subdomains,
_["domain"] = domains, _["suffix"] = suffixes,
_["stringsAsFactors"] = false);
}
//[[Rcpp::export]]
CharacterVector tld_extract_(CharacterVector domains){
unsigned int input_size = domains.size();
CharacterVector output(input_size);
std::string holding;
size_t fragment_location;
for(unsigned int i = 0; i < input_size; i++){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
if(domains[i] == NA_STRING){
output[i] = NA_STRING;
} else {
holding = Rcpp::as(domains[i]);
fragment_location = holding.rfind(".");
if(fragment_location == std::string::npos || fragment_location == (holding.size() - 1)){
output[i] = NA_STRING;
} else {
output[i] = holding.substr(fragment_location+1);
}
}
}
return output;
}
//[[Rcpp::export]]
CharacterVector host_extract_(CharacterVector domains){
unsigned int input_size = domains.size();
CharacterVector output(input_size);
std::string holding;
size_t fragment_location;
for(unsigned int i = 0; i < input_size; i++){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
if(domains[i] == NA_STRING){
output[i] = NA_STRING;
} else {
holding = Rcpp::as(domains[i]);
fragment_location = holding.find(".");
if(fragment_location == std::string::npos){
output[i] = NA_STRING;
} else {
output[i] = holding.substr(0, fragment_location);
}
}
}
return output;
}
urltools/src/punycode.h 0000644 0001762 0000144 00000011551 13230557631 014733 0 ustar ligges users /*
punycode.c from RFC 3492
http://www.nicemice.net/idn/
Adam M. Costello
http://www.nicemice.net/amc/
This is ANSI C code (C89) implementing Punycode (RFC 3492).
C. Disclaimer and license
Regarding this entire document or any portion of it (including
the pseudocode and C code), the author makes no guarantees and
is not responsible for any damage resulting from its use. The
author grants irrevocable permission to anyone to use, modify,
and distribute it in any way that does not diminish the rights
of anyone else to use, modify, and distribute it, provided that
redistributed derivative works do not contain misleading author or
version information. Derivative works need not be licensed under
similar terms.
*/
#ifdef __cplusplus
extern "C" {
#endif /* __cplusplus */
/************************************************************/
/* Public interface (would normally go in its own .h file): */
#include
enum punycode_status {
punycode_success,
punycode_bad_input, /* Input is invalid. */
punycode_big_output, /* Output would exceed the space provided. */
punycode_overflow /* Input needs wider integers to process. */
};
#if UINT_MAX >= (1 << 26) - 1
typedef unsigned int punycode_uint;
#else
typedef unsigned long punycode_uint;
#endif
enum punycode_status punycode_encode(
punycode_uint input_length,
const punycode_uint input[],
const unsigned char case_flags[],
punycode_uint *output_length,
char output[] );
/* punycode_encode() converts Unicode to Punycode. The input */
/* is represented as an array of Unicode code points (not code */
/* units; surrogate pairs are not allowed), and the output */
/* will be represented as an array of ASCII code points. The */
/* output string is *not* null-terminated; it will contain */
/* zeros if and only if the input contains zeros. (Of course */
/* the caller can leave room for a terminator and add one if */
/* needed.) The input_length is the number of code points in */
/* the input. The output_length is an in/out argument: the */
/* caller passes in the maximum number of code points that it */
/* can receive, and on successful return it will contain the */
/* number of code points actually output. The case_flags array */
/* holds input_length boolean values, where nonzero suggests that */
/* the corresponding Unicode character be forced to uppercase */
/* after being decoded (if possible), and zero suggests that */
/* it be forced to lowercase (if possible). ASCII code points */
/* are encoded literally, except that ASCII letters are forced */
/* to uppercase or lowercase according to the corresponding */
/* uppercase flags. If case_flags is a null pointer then ASCII */
/* letters are left as they are, and other code points are */
/* treated as if their uppercase flags were zero. The return */
/* value can be any of the punycode_status values defined above */
/* except punycode_bad_input; if not punycode_success, then */
/* output_size and output might contain garbage. */
enum punycode_status punycode_decode(
punycode_uint input_length,
const char input[],
punycode_uint *output_length,
punycode_uint output[],
unsigned char case_flags[] );
/* punycode_decode() converts Punycode to Unicode. The input is */
/* represented as an array of ASCII code points, and the output */
/* will be represented as an array of Unicode code points. The */
/* input_length is the number of code points in the input. The */
/* output_length is an in/out argument: the caller passes in */
/* the maximum number of code points that it can receive, and */
/* on successful return it will contain the actual number of */
/* code points output. The case_flags array needs room for at */
/* least output_length values, or it can be a null pointer if the */
/* case information is not needed. A nonzero flag suggests that */
/* the corresponding Unicode character be forced to uppercase */
/* by the caller (if possible), while zero suggests that it be */
/* forced to lowercase (if possible). ASCII code points are */
/* output already in the proper case, but their flags will be set */
/* appropriately so that applying the flags would be harmless. */
/* The return value can be any of the punycode_status values */
/* defined above; if not punycode_success, then output_length, */
/* output, and case_flags might contain garbage. On success, the */
/* decoder will never need to write an output_length greater than */
/* input_length, because of how the encoding is defined. */
#ifdef __cplusplus
}
#endif /* __cplusplus */
urltools/src/punycode.c 0000644 0001762 0000144 00000022005 13230557631 014722 0 ustar ligges users /*
punycode.c from RFC 3492
http://www.nicemice.net/idn/
Adam M. Costello
http://www.nicemice.net/amc/
This is ANSI C code (C89) implementing Punycode (RFC 3492).
C. Disclaimer and license
Regarding this entire document or any portion of it (including
the pseudocode and C code), the author makes no guarantees and
is not responsible for any damage resulting from its use. The
author grants irrevocable permission to anyone to use, modify,
and distribute it in any way that does not diminish the rights
of anyone else to use, modify, and distribute it, provided that
redistributed derivative works do not contain misleading author or
version information. Derivative works need not be licensed under
similar terms.
*/
#include "punycode.h"
/**********************************************************/
/* Implementation (would normally go in its own .c file): */
#include
/*** Bootstring parameters for Punycode ***/
enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700,
initial_bias = 72, initial_n = 0x80, delimiter = 0x2D };
/* basic(cp) tests whether cp is a basic code point: */
#define basic(cp) ((punycode_uint)(cp) < 0x80)
/* delim(cp) tests whether cp is a delimiter: */
#define delim(cp) ((cp) == delimiter)
/* decode_digit(cp) returns the numeric value of a basic code */
/* point (for use in representing integers) in the range 0 to */
/* base-1, or base if cp is does not represent a value. */
static punycode_uint decode_digit(punycode_uint cp)
{
return cp - 48 < 10 ? cp - 22 : cp - 65 < 26 ? cp - 65 :
cp - 97 < 26 ? cp - 97 : base;
}
/* encode_digit(d,flag) returns the basic code point whose value */
/* (when used for representing integers) is d, which needs to be in */
/* the range 0 to base-1. The lowercase form is used unless flag is */
/* nonzero, in which case the uppercase form is used. The behavior */
/* is undefined if flag is nonzero and digit d has no uppercase form. */
static char encode_digit(punycode_uint d, int flag)
{
return d + 22 + 75 * (d < 26) - ((flag != 0) << 5);
/* 0..25 map to ASCII a..z or A..Z */
/* 26..35 map to ASCII 0..9 */
}
/* flagged(bcp) tests whether a basic code point is flagged */
/* (uppercase). The behavior is undefined if bcp is not a */
/* basic code point. */
#define flagged(bcp) ((punycode_uint)(bcp) - 65 < 26)
/* encode_basic(bcp,flag) forces a basic code point to lowercase */
/* if flag is zero, uppercase if flag is nonzero, and returns */
/* the resulting code point. The code point is unchanged if it */
/* is caseless. The behavior is undefined if bcp is not a basic */
/* code point. */
static char encode_basic(punycode_uint bcp, int flag)
{
bcp -= (bcp - 97 < 26) << 5;
return bcp + ((!flag && (bcp - 65 < 26)) << 5);
}
/*** Platform-specific constants ***/
/* maxint is the maximum value of a punycode_uint variable: */
static const punycode_uint maxint = (punycode_uint) -1;
/* Because maxint is unsigned, -1 becomes the maximum value. */
/*** Bias adaptation function ***/
static punycode_uint adapt(
punycode_uint delta, punycode_uint numpoints, int firsttime )
{
punycode_uint k;
delta = firsttime ? delta / damp : delta >> 1;
/* delta >> 1 is a faster way of doing delta / 2 */
delta += delta / numpoints;
for (k = 0; delta > ((base - tmin) * tmax) / 2; k += base) {
delta /= base - tmin;
}
return k + (base - tmin + 1) * delta / (delta + skew);
}
/*** Main encode function ***/
enum punycode_status punycode_encode(
punycode_uint input_length,
const punycode_uint input[],
const unsigned char case_flags[],
punycode_uint *output_length,
char output[] )
{
punycode_uint n, delta, h, b, out, max_out, bias, j, m, q, k, t;
/* Initialize the state: */
n = initial_n;
delta = out = 0;
max_out = *output_length;
bias = initial_bias;
/* Handle the basic code points: */
for (j = 0; j < input_length; ++j) {
if (basic(input[j])) {
if (max_out - out < 2) return punycode_big_output;
output[out++] =
case_flags ? encode_basic(input[j], case_flags[j]) : (char)input[j];
}
/* else if (input[j] < n) return punycode_bad_input; */
/* (not needed for Punycode with unsigned code points) */
}
h = b = out;
/* h is the number of code points that have been handled, b is the */
/* number of basic code points, and out is the number of characters */
/* that have been output. */
if (b > 0) output[out++] = delimiter;
/* Main encoding loop: */
while (h < input_length) {
/* All non-basic code points < n have been */
/* handled already. Find the next larger one: */
for (m = maxint, j = 0; j < input_length; ++j) {
/* if (basic(input[j])) continue; */
/* (not needed for Punycode) */
if (input[j] >= n && input[j] < m) m = input[j];
}
/* Increase delta enough to advance the decoder's */
/* state to , but guard against overflow: */
if (m - n > (maxint - delta) / (h + 1)) return punycode_overflow;
delta += (m - n) * (h + 1);
n = m;
for (j = 0; j < input_length; ++j) {
/* Punycode does not need to check whether input[j] is basic: */
if (input[j] < n /* || basic(input[j]) */ ) {
if (++delta == 0) return punycode_overflow;
}
if (input[j] == n) {
/* Represent delta as a generalized variable-length integer: */
for (q = delta, k = base; ; k += base) {
if (out >= max_out) return punycode_big_output;
t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
k >= bias + tmax ? tmax : k - bias;
if (q < t) break;
output[out++] = encode_digit(t + (q - t) % (base - t), 0);
q = (q - t) / (base - t);
}
output[out++] = encode_digit(q, case_flags && case_flags[j]);
bias = adapt(delta, h + 1, h == b);
delta = 0;
++h;
}
}
++delta, ++n;
}
*output_length = out;
return punycode_success;
}
/*** Main decode function ***/
enum punycode_status punycode_decode(
punycode_uint input_length,
const char input[],
punycode_uint *output_length,
punycode_uint output[],
unsigned char case_flags[] )
{
punycode_uint n, out, i, max_out, bias,
b, j, in, oldi, w, k, digit, t;
if (!input_length) {
return punycode_bad_input;
}
/* Initialize the state: */
n = initial_n;
out = i = 0;
max_out = *output_length;
bias = initial_bias;
/* Handle the basic code points: Let b be the number of input code */
/* points before the last delimiter, or 0 if there is none, then */
/* copy the first b code points to the output. */
for (b = 0, j = input_length - 1 ; j > 0; --j) {
if (delim(input[j])) {
b = j;
break;
}
}
if (b > max_out) return punycode_big_output;
for (j = 0; j < b; ++j) {
if (case_flags) case_flags[out] = flagged(input[j]);
if (!basic(input[j])) return punycode_bad_input;
output[out++] = input[j];
}
/* Main decoding loop: Start just after the last delimiter if any */
/* basic code points were copied; start at the beginning otherwise. */
for (in = b > 0 ? b + 1 : 0; in < input_length; ++out) {
/* in is the index of the next character to be consumed, and */
/* out is the number of code points in the output array. */
/* Decode a generalized variable-length integer into delta, */
/* which gets added to i. The overflow checking is easier */
/* if we increase i as we go, then subtract off its starting */
/* value at the end to obtain delta. */
for (oldi = i, w = 1, k = base; ; k += base) {
if (in >= input_length) return punycode_bad_input;
digit = decode_digit(input[in++]);
if (digit >= base) return punycode_bad_input;
if (digit > (maxint - i) / w) return punycode_overflow;
i += digit * w;
t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
k >= bias + tmax ? tmax : k - bias;
if (digit < t) break;
if (w > maxint / (base - t)) return punycode_overflow;
w *= (base - t);
}
bias = adapt(i - oldi, out + 1, oldi == 0);
/* i was supposed to wrap around from out+1 to 0, */
/* incrementing n each time, so we'll fix that now: */
if (i / (out + 1) > maxint - n) return punycode_overflow;
n += i / (out + 1);
i %= (out + 1);
/* Insert n at position i of the output: */
/* not needed for Punycode: */
/* if (decode_digit(n) <= base) return punycode_invalid_input; */
if (out >= max_out) return punycode_big_output;
if (case_flags) {
memmove(case_flags + i + 1, case_flags + i, out - i);
/* Case of last character determines uppercase flag: */
case_flags[i] = flagged(input[in - 1]);
}
memmove(output + i + 1, output + i, (out - i) * sizeof *output);
output[i++] = n;
}
*output_length = out;
return punycode_success;
}
urltools/src/Makevars 0000644 0001762 0000144 00000000030 13230557631 014416 0 ustar ligges users PKG_CPPFLAGS = -UNDEBUG
urltools/src/puny.cpp 0000644 0001762 0000144 00000013446 13230557631 014440 0 ustar ligges users #include
#include "punycode.h"
extern "C"{
#include "utf8.h"
}
using namespace Rcpp;
#define R_NO_REMAP
#include
#include
#define BUFLENT 2048
static char buf[BUFLENT];
static uint32_t ibuf[BUFLENT];
static std::string ascii = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890_.?&=:/";
static inline void clearbuf(){
for (int i=0; i split_url;
std::string protocol;
std::string path;
};
void split_url(std::string x, url& output){
size_t last;
size_t loc = x.find(".");
last = x.find("://");
if(last != std::string::npos){
output.protocol = x.substr(0, (last + 3));
x = x.substr(last + 3);
}
last = x.find_first_of(":/");
if(last != std::string::npos){
output.path = x.substr(last);
x = x.substr(0, last);
}
last = 0;
loc = x.find(".");
while (loc != std::string::npos) {
output.split_url.push_back(x.substr(last, loc-last));
last = ++loc;
loc = x.find(".", loc);
}
if (loc == std::string::npos){
output.split_url.push_back(x.substr(last, x.length()));
}
}
std::string check_result(enum punycode_status& st, std::string& x){
std::string ret = "Error with the URL " + x + ":";
if (st == punycode_bad_input){
ret += "input is invalid";
} else if (st == punycode_big_output){
ret += "output would exceed the space provided";
} else if (st == punycode_overflow){
ret += "input needs wider integers to process";
} else {
return "";
}
return ret;
}
String encode_single(std::string x){
url holding;
split_url(x, holding);
std::string output = holding.protocol;
for(unsigned int i = 0; i < holding.split_url.size(); i++){
// Check if it's ASCII-only fragment - if so, nowt to do here.
if(holding.split_url[i].find_first_not_of(ascii) == std::string::npos){
output += holding.split_url[i];
if(i < (holding.split_url.size() - 1)){
output += ".";
}
} else {
// Prep for conversion
punycode_uint buflen = BUFLENT;
punycode_uint unilen = BUFLENT;
const char *s = holding.split_url[i].c_str();
const int slen = strlen(s);
// Do the conversion
unilen = u8_toucs(ibuf, unilen, s, slen);
enum punycode_status st = punycode_encode(unilen, ibuf, NULL, &buflen, buf);
// Check it worked
std::string ret = check_result(st, x);
if(ret.size()){
Rcpp::warning(ret);
return NA_STRING;
}
std::string encoded = Rcpp::as(Rf_mkCharLenCE(buf, buflen, CE_UTF8));
if(encoded != holding.split_url[i]){
encoded = "xn--" + encoded;
}
output += encoded;
if(i < (holding.split_url.size() - 1)){
output += ".";
}
}
}
output += holding.path;
return output;
}
//'@title Encode or Decode Internationalised Domains
//'@description \code{puny_encode} and \code{puny_decode} implement
//'the encoding standard for internationalised (non-ASCII) domains and
//'subdomains. You can use them to encode UTF-8 domain names, or decode
//'encoded names (which start "xn--"), or both.
//'
//'@param x a vector of URLs. These should be URL decoded using \code{\link{url_decode}}.
//'
//'@return a CharacterVector containing encoded or decoded versions of the entries in \code{x}.
//'Invalid URLs (ones that are \code{NA}, or ones that do not successfully map to an actual
//'decoded or encoded version) will be returned as \code{NA}.
//'
//'@examples
//'# Encode a URL
//'puny_encode("https://www.bücher.com/foo")
//'
//'# Decode the result, back to the original
//'puny_decode("https://www.xn--bcher-kva.com/foo")
//'
//'@seealso \code{\link{url_decode}} and \code{\link{url_encode}} for percent-encoding.
//'
//'@rdname puny
//'@export
//[[Rcpp::export]]
CharacterVector puny_encode(CharacterVector x){
unsigned int input_size = x.size();
CharacterVector output(input_size);
for(unsigned int i = 0; i < input_size; i++){
if(i % 10000 == 0){
Rcpp::checkUserInterrupt();
}
if(x[i] == NA_STRING){
output[i] = NA_STRING;
} else {
output[i] = encode_single(Rcpp::as(x[i]));
}
}
clearbuf();
return output;
}
String decode_single(std::string x){
url holding;
split_url(x, holding);
String output(holding.protocol, CE_UTF8);
for(unsigned int i = 0; i < holding.split_url.size(); i++){
// Check if it's ASCII-only fragment - if so, nowt to do here.
if(holding.split_url[i].size() < 4 || holding.split_url[i].substr(0,4) != "xn--"){
output += holding.split_url[i];
if(i < (holding.split_url.size() - 1)){
output += ".";
}
} else {
// Prep for conversion
punycode_uint unilen = BUFLENT;
std::string tmp = holding.split_url[i].substr(4);
const char *s = tmp.c_str();
const int slen = strlen(s);
// Do the conversion
enum punycode_status st = punycode_decode(slen, s, &unilen, ibuf, NULL);
// Check it worked
std::string ret = check_result(st, x);
if(ret.size()){
Rcpp::warning(ret);
return NA_STRING;
}
u8_toutf8(buf, BUFLENT, ibuf, unilen);
output += buf;
if(i < (holding.split_url.size() - 1)){
output += ".";
}
}
}
output += holding.path;
return output;
}
//'@rdname puny
//'@export
//[[Rcpp::export]]
CharacterVector puny_decode(CharacterVector x){
unsigned int input_size = x.size();
CharacterVector output(input_size);
for(unsigned int i = 0; i < input_size; i++){
if(i % 10000 == 0){
Rcpp::checkUserInterrupt();
}
if(x[i] == NA_STRING){
output[i] = NA_STRING;
} else {
output[i] = decode_single(Rcpp::as(x[i]));
}
}
return output;
}
urltools/src/utf8.c 0000644 0001762 0000144 00000011417 13230557631 013767 0 ustar ligges users /*
Basic UTF-8 manipulation routines
by Jeff Bezanson
placed in the public domain Fall 2005
This code is designed to provide the utilities you need to manipulate
UTF-8 as an internal string encoding. These functions do not perform the
error checking normally needed when handling UTF-8 data, so if you happen
to be from the Unicode Consortium you will want to flay me alive.
I do this because error checking can be performed at the boundaries (I/O),
with these routines reserved for higher performance on data known to be
valid.
A UTF-8 validation routine is included.
*/
#include
#include
#include
#include
#include
#include
#include
#ifdef WIN32
#include
#define snprintf _snprintf
#else
#ifndef __FreeBSD__
#include
#endif /* __FreeBSD__ */
#endif
#include
#include "utf8.h"
static const uint32_t offsetsFromUTF8[6] = {
0x00000000UL, 0x00003080UL, 0x000E2080UL,
0x03C82080UL, 0xFA082080UL, 0x82082080UL
};
static const char trailingBytesForUTF8[256] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};
/* returns length of next utf-8 sequence */
size_t u8_seqlen(const char *s)
{
return trailingBytesForUTF8[(unsigned int)(unsigned char)s[0]] + 1;
}
/* returns the # of bytes needed to encode a certain character
0 means the character cannot (or should not) be encoded. */
size_t u8_charlen(uint32_t ch)
{
if (ch < 0x80)
return 1;
else if (ch < 0x800)
return 2;
else if (ch < 0x10000)
return 3;
else if (ch < 0x110000)
return 4;
return 0;
}
size_t u8_codingsize(uint32_t *wcstr, size_t n)
{
size_t i, c=0;
for(i=0; i < n; i++)
c += u8_charlen(wcstr[i]);
return c;
}
/* conversions without error checking
only works for valid UTF-8, i.e. no 5- or 6-byte sequences
srcsz = source size in bytes
sz = dest size in # of wide characters
returns # characters converted
if sz == srcsz+1 (i.e. 4*srcsz+4 bytes), there will always be enough space.
*/
size_t u8_toucs(uint32_t *dest, size_t sz, const char *src, size_t srcsz)
{
uint32_t ch;
const char *src_end = src + srcsz;
size_t nb;
size_t i=0;
if (sz == 0 || srcsz == 0)
return 0;
while (i < sz) {
if (!isutf(*src)) { // invalid sequence
dest[i++] = 0xFFFD;
src++;
if (src >= src_end) break;
continue;
}
nb = trailingBytesForUTF8[(unsigned char)*src];
if (src + nb >= src_end)
break;
ch = 0;
switch (nb) {
/* these fall through deliberately */
case 5: ch += (unsigned char)*src++; ch <<= 6;
case 4: ch += (unsigned char)*src++; ch <<= 6;
case 3: ch += (unsigned char)*src++; ch <<= 6;
case 2: ch += (unsigned char)*src++; ch <<= 6;
case 1: ch += (unsigned char)*src++; ch <<= 6;
case 0: ch += (unsigned char)*src++;
}
ch -= offsetsFromUTF8[nb];
dest[i++] = ch;
}
return i;
}
/* srcsz = number of source characters
sz = size of dest buffer in bytes
returns # bytes stored in dest
the destination string will never be bigger than the source string.
*/
size_t u8_toutf8(char *dest, size_t sz, const uint32_t *src, size_t srcsz)
{
uint32_t ch;
size_t i = 0;
char *dest0 = dest;
char *dest_end = dest + sz;
while (i < srcsz) {
ch = src[i];
if (ch < 0x80) {
if (dest >= dest_end)
break;
*dest++ = (char)ch;
}
else if (ch < 0x800) {
if (dest >= dest_end-1)
break;
*dest++ = (ch>>6) | 0xC0;
*dest++ = (ch & 0x3F) | 0x80;
}
else if (ch < 0x10000) {
if (dest >= dest_end-2)
break;
*dest++ = (ch>>12) | 0xE0;
*dest++ = ((ch>>6) & 0x3F) | 0x80;
*dest++ = (ch & 0x3F) | 0x80;
}
else if (ch < 0x110000) {
if (dest >= dest_end-3)
break;
*dest++ = (ch>>18) | 0xF0;
*dest++ = ((ch>>12) & 0x3F) | 0x80;
*dest++ = ((ch>>6) & 0x3F) | 0x80;
*dest++ = (ch & 0x3F) | 0x80;
}
i++;
}
return (dest-dest0);
}
urltools/src/compose.h 0000644 0001762 0000144 00000002445 13230557631 014554 0 ustar ligges users #include
using namespace Rcpp;
#ifndef __COMPOSE_INCLUDED__
#define __COMPOSE_INCLUDED__
/**
* A namespace for recomposing parsed URLs
*/
namespace compose {
/**
* A function for briefly checking if a component is empty before doing anything
* with it
*
* @param str a Rcpp String to check
*
* @return true if the string is not empty, false if it is.
*/
bool emptycheck(String element);
/**
* A function for recomposing a single URL
*
* @param scheme the scheme of the URL
*
* @param domain the domain of the URL
*
* @param port the port of the URL
*
* @param path the path of the URL
*
* @param parameter the parameter of the URL
*
* @param fragment the fragment of the URL
*
* @return an Rcpp String containing the recomposed URL
*
* @seealso compose_multiple for the vectorised version
*/
std::string compose_single(String scheme, String domain, String port, String path,
String parameter, String fragment);
/**
* A function for recomposing a vector of URLs
*
* @param parsed_urls a DataFrame provided by url_parse
*
* @return a CharacterVector containing the recomposed URLs
*/
CharacterVector compose_multiple(DataFrame parsed_urls);
}
#endif
urltools/src/parsing.h 0000644 0001762 0000144 00000007533 13230557631 014555 0 ustar ligges users #include
using namespace Rcpp;
#ifndef __PARSING_INCLUDED__
#define __PARSING_INCLUDED__
namespace parsing {
/**
* A function for parsing a URL and turning it into a vector.
* Tremendously useful (read: everything breaks without this)
*
* @param url a URL.
*
* @see get_ and set_component, which call this.
*
* @return a vector consisting of the value for each component
* part of the URL.
*/
CharacterVector url_to_vector(std::string url);
/**
* A function for lower-casing an entire string
*
* @param str a string to lower-case
*
* @return a string containing the lower-cased version of the
* input.
*/
std::string string_tolower(std::string str);
/**
* A function for extracting the scheme of a URL; part of the
* URL parsing framework.
*
* @param url a reference to a url.
*
* @see url_to_vector which calls this.
*
* @return a string containing the scheme of the URL if identifiable,
* and "" if not.
*/
std::string scheme(std::string& url);
/**
* A function for extracting the domain and port of a URL; part of the
* URL parsing framework. Fairly unique in that it outputs a
* vector, unlike the rest of the framework, which outputs a string,
* since it has to handle multiple elements.
*
* @param url a reference to a url. Should've been run through
* scheme() first.
*
* @see url_to_vector which calls this.
*
* @return a vector containing the domain and port of the URL if identifiable,
* and "" for each non-identifiable element.
*/
std::vector < std::string > domain_and_port(std::string& url);
/**
* A function for extracting the path of a URL; part of the
* URL parsing framework.
*
* @param url a reference to a url. Should've been run through
* scheme() and domain_and_port() first.
*
* @see url_to_vector which calls this.
*
* @return a string containing the path of the URL if identifiable,
* and "" if not.
*/
std::string path(std::string& url);
/**
* A function for extracting the path of a URL; part of the
* URL parsing framework.
*
* @param url a reference to a url. Should've been run through
* scheme(), domain_and_port() and path() first.
*
* @see url_to_vector which calls this.
*
* @return a string containing the query string of the URL if identifiable,
* and "" if not.
*/
std::string query(std::string& url);
String check_parse_out(std::string x);
/**
* A function to retrieve an individual component from a parsed
* URL. Used in scheme(), host() et al; calls parse_url.
*
* @param url a URL.
*
* @param component an integer representing which value in
* parse_url's returned vector to grab.
*
* @see set_component, which allows for modification.
*
* @return a string consisting of the requested URL component.
*/
String get_component(std::string url, int component);
/**
* A function to set an individual component in a parsed
* URL. Used in "scheme<-", et al; calls parse_url.
*
* @param url a URL.
*
* @param component an integer representing which value in
* parse_url's returned vector to modify.
*
* @param new_value the value to insert into url[component].
*
* @param rm whether the intent is to remove the component
* (in which case new_value must be an NA_STRING)
*
* @see get_component, which allows for retrieval.
*
* @return a string consisting of the modified URL.
*/
String set_component(std::string url, int component, String new_value,
bool rm);
/**
* Decompose a vector of URLs and turn it into a data.frame.
*
* @param URLs a reference to a vector of URLs
*
* @return an Rcpp data.frame.
*
*/
DataFrame parse_to_df(CharacterVector& urls_ptr);
}
#endif
urltools/src/parameter.h 0000644 0001762 0000144 00000005526 13230557631 015072 0 ustar ligges users #include "parsing.h"
#ifndef __PARAM_INCLUDED__
#define __PARAM_INCLUDED__
namespace parameter {
/**
* Split out a URL query from the actual body. Used
* in set_ and remove_parameter.
*
* @param url a URL.
*
* @return a deque either of length 1, indicating that no
* query was found, or 2, indicating that one was.
*/
std::deque < std::string > get_query_string(std::string url);
/**
* Set the value of a single key=value parameter.
*
* @param url a URL.
*
* @param component a reference to the key to set
*
* @param value a reference to the value to set.
*
* @return a string containing URL + key=value, controlling
* for the possibility that the URL did not previously have a query
* associated - or did, and /had that key/, but was associating a
* different value with it.
*/
std::string set_parameter(std::string url, std::string& component, std::string value);
String get_parameter_single(std::string url, std::string& component);
/**
* Reemove a range of key/value parameters
*
* @param url a URL.
*
* @param params a vector of keys.
*
* @return a string containing the URL but absent the keys and values that were specified.
*
*/
std::string remove_parameter_single(std::string url, CharacterVector params);
std::deque< std::string > get_parameter_names_single(std::string url);
/**
* Component retrieval specifically for parameters.
*
* @param urls a reference to a vector of URLs
*
* @param component the name of a component to retrieve
* the value of
*
* @return a vector of the values for that component.
*/
CharacterVector get_parameter(CharacterVector& urls, std::string component);
/**
* Scan a list of URLS for parameter names used.
*
* @param A reference to a character vector of urls
*
* @return a vector of unique parameter names.
*/
CharacterVector get_parameter_names(CharacterVector &urls);
/**
* Set the value of a single key=value parameter for a vector of strings.
*
* @param urls a vector of URLs.
*
* @param component a string containing the key to set
*
* @param value a vector of values to set.
*
* @return the initial URLs vector, with the aforementioned string modifications.
*/
CharacterVector set_parameter_vectorised(CharacterVector urls, String component,
CharacterVector value);
/**
* Reemove a range of key/value parameters from a vector of strings.
*
* @param urls a vector of URLs.
*
* @param params a vector of keys.
*
* @return the initial URLs vector, with the aforementioned string modifications.
*
*/
CharacterVector remove_parameter_vectorised(CharacterVector urls,
CharacterVector params);
}
#endif
urltools/src/parsing.cpp 0000644 0001762 0000144 00000034741 13230557631 015111 0 ustar ligges users #include "parsing.h"
std::string parsing::string_tolower(std::string str){
unsigned int input_size = str.size();
for(unsigned int i = 0; i < input_size; i++){
str[i] = tolower(str[i]);
}
return str;
}
std::string parsing::scheme(std::string& url){
std::string output;
std::size_t protocol = url.find("://");
std::size_t definite_end = url.find(".");
if((protocol == std::string::npos) || protocol > definite_end){
//If that's not present, or isn't present at the /beginning/, unknown
output = "";
} else {
output = url.substr(0,protocol);
url = url.substr((protocol+3));
}
return output;
}
std::vector < std::string > parsing::domain_and_port(std::string& url){
std::vector < std::string > output(2);
std::string holding;
unsigned int output_offset = 0;
// Check for the presence of user authentication info. If it exists, dump it.
// Use a query-check here because some people put @ info in params, baaah
std::size_t f_param = url.find("?");
std::size_t auth;
if(f_param != std::string::npos){
auth = url.substr(0, f_param).find("@");
} else {
auth = url.find("@");
}
if(auth != std::string::npos){
url = url.substr(auth+1);
}
// ID IPv6(?)
if(url.size() && url[0] == '['){
std::size_t ipv6_end = url.find("]");
if(ipv6_end != std::string::npos){
output[0] = url.substr(1,(ipv6_end-1));
if(ipv6_end == url.size()-1){
url = "";
return output;
}
url = url.substr(ipv6_end+1);
}
}
// Identify the port. If there is one, push everything
// before that straight into the output, and the remainder
// into the holding string. If not, the entire
// url goes into the holding string.
std::size_t port = url.find(":");
if(port != std::string::npos && url.find("/") >= port){
output[0] += url.substr(0,port);
holding = url.substr(port+1);
output_offset++;
} else {
holding = url;
}
// Look for a trailing slash
std::size_t trailing_slash = holding.find("/");
// If there is one, that's when everything ends
if(trailing_slash != std::string::npos){
output[output_offset] = holding.substr(0, trailing_slash);
output_offset++;
url = holding.substr(trailing_slash+1);
return output;
}
// If not, there might be a query parameter or fragment
// associated
// with the base URL, which we need to preserve.
std::size_t param = holding.find("?");
// If there is, handle that
if(param != std::string::npos){
output[output_offset] = holding.substr(0, param);
url = holding.substr(param);
return output;
} else {
std::size_t frag = holding.find("#");
if(frag != std::string::npos){
output[output_offset] = holding.substr(0, frag);
url = holding.substr(frag);
return output;
}
}
// Otherwise we're done here
output[output_offset] = holding;
url = "";
return output;
}
std::string parsing::path(std::string& url){
if(url.size() == 0){
return url;
}
std::string output;
std::size_t path = url.find("?");
if(path == std::string::npos){
std::size_t fragment = url.find("#");
if(fragment == std::string::npos){
output = url;
url = "";
return output;
}
output = url.substr(0,fragment);
url = url.substr(fragment);
return output;
}
output = url.substr(0,path);
url = url.substr(path+1);
return output;
}
std::string parsing::query(std::string& url){
if(url == ""){
return url;
}
std::string output;
std::size_t fragment = url.find("#");
if(fragment == std::string::npos){
output = url;
url = "";
return output;
}
output = url.substr(0,fragment);
url = url.substr(fragment+1);
return output;
}
String parsing::check_parse_out(std::string x){
if(x == ""){
return NA_STRING;
}
return x;
}
//URL parser
CharacterVector parsing::url_to_vector(std::string url){
std::string &url_ptr = url;
//Output object, holding object, normalise.
CharacterVector output(6);
std::vector < std::string > holding(2);
std::string s = scheme(url_ptr);
holding = domain_and_port(url_ptr);
//Run
output[0] = check_parse_out(string_tolower(s));
output[1] = check_parse_out(string_tolower(holding[0]));
output[2] = check_parse_out(holding[1]);
output[3] = check_parse_out(path(url_ptr));
output[4] = check_parse_out(query(url_ptr));
output[5] = check_parse_out(url_ptr);
return output;
}
//Component retrieval
String parsing::get_component(std::string url, int component){
return url_to_vector(url)[component];
}
//Component modification
String parsing::set_component(std::string url, int component, String new_value,
bool rm){
if(new_value == NA_STRING && !rm){
return NA_STRING;
}
std::string output;
CharacterVector parsed_url = url_to_vector(url);
parsed_url[component] = new_value;
if(parsed_url[0] != NA_STRING){
output += parsed_url[0];
output += "://";
}
if(parsed_url[1] != NA_STRING){
output += parsed_url[1];
}
if(parsed_url[2] != NA_STRING){
output += ":";
output += parsed_url[2];
}
if(parsed_url[3] != NA_STRING){
output += "/";
output += parsed_url[3];
}
if(parsed_url[4] != NA_STRING){
output += "?";
output += parsed_url[4];
}
if(parsed_url[5] != NA_STRING){
output += "#";
output += parsed_url[5];
}
return output;
}
DataFrame parsing::parse_to_df(CharacterVector& urls_ptr){
//Input and holding objects
unsigned int input_size = urls_ptr.size();
CharacterVector holding(6);
//Output objects
CharacterVector schemes(input_size);
CharacterVector domains(input_size);
CharacterVector ports(input_size);
CharacterVector paths(input_size);
CharacterVector parameters(input_size);
CharacterVector fragments(input_size);
for(unsigned int i = 0; i < input_size; i++){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
// Handle NAs on input
if(urls_ptr[i] == NA_STRING){
schemes[i] = NA_STRING;
domains[i] = NA_STRING;
ports[i] = NA_STRING;
paths[i] = NA_STRING;
parameters[i] = NA_STRING;
fragments[i] = NA_STRING;
} else {
holding = url_to_vector(Rcpp::as(urls_ptr[i]));
schemes[i] = holding[0];
domains[i] = holding[1];
ports[i] = holding[2];
paths[i] = holding[3];
parameters[i] = holding[4];
fragments[i] = holding[5];
}
}
return DataFrame::create(_["scheme"] = schemes,
_["domain"] = domains,
_["port"] = ports,
_["path"] = paths,
_["parameter"] = parameters,
_["fragment"] = fragments,
_["stringsAsFactors"] = false);
}
//'@title split URLs into their component parts
//'@description \code{url_parse} takes a vector of URLs and splits each one into its component
//'parts, as recognised by RfC 3986.
//'
//'@param urls a vector of URLs
//'
//'@details It's useful to be able to take a URL and split it out into its component parts -
//'for the purpose of hostname extraction, for example, or analysing API calls. This functionality
//'is not provided in base R, although it is provided in \code{\link[httr]{parse_url}}; that
//'implementation is entirely in R, uses regular expressions, and is not vectorised. It's
//'perfectly suitable for the intended purpose (decomposition in the context of automated
//'HTTP requests from R), but not for large-scale analysis.
//'
//'Note that user authentication/identification information is not extracted;
//'this can be found with \code{\link{get_credentials}}.
//'
//'@return a data.frame consisting of the columns scheme, domain, port, path, query
//'and fragment. See the '\href{http://tools.ietf.org/html/rfc3986}{relevant IETF RfC} for
//'definitions. If an element cannot be identified, it is represented by an empty string.
//'
//'@examples
//'url_parse("https://en.wikipedia.org/wiki/Article")
//'
//'@seealso \code{\link{param_get}} for extracting values associated with particular keys in a URL's
//'query string, and \code{\link{url_compose}}, which is \code{url_parse} in reverse.
//'
//'@export
//[[Rcpp::export]]
DataFrame url_parse(CharacterVector urls){
CharacterVector& urls_ptr = urls;
return parsing::parse_to_df(urls_ptr);
}
//[[Rcpp::export]]
CharacterVector get_component_(CharacterVector urls, int component){
unsigned int input_size = urls.size();
CharacterVector output(input_size);
for (unsigned int i = 0; i < input_size; ++i){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
if(urls[i] != NA_STRING){
output[i] = parsing::get_component(Rcpp::as(urls[i]), component);
} else {
output[i] = NA_STRING;
}
}
return output;
}
//[[Rcpp::export]]
CharacterVector set_component_(CharacterVector urls, int component,
CharacterVector new_value){
unsigned int input_size = urls.size();
CharacterVector output(input_size);
if(new_value.size() == 1){
for (unsigned int i = 0; i < input_size; ++i){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
output[i] = parsing::set_component(Rcpp::as(urls[i]), component, new_value[0], false);
}
} else if(new_value.size() == input_size){
for (unsigned int i = 0; i < input_size; ++i){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
output[i] = parsing::set_component(Rcpp::as(urls[i]), component, new_value[i], false);
}
} else {
Rcpp::stop("The number of new values must either be 1, or match the number of URLs");
}
return output;
}
//[[Rcpp::export]]
CharacterVector set_component_r(CharacterVector urls, int component,
CharacterVector new_value,
std::string comparator){
// Output object
unsigned int input_size = urls.size();
CharacterVector output(input_size);
// Comparator checking objects
std::string holding;
String to_use;
unsigned int holding_size;
unsigned int comparator_length = comparator.size();
// Otherwise, if we've got a single value, iterate
if(new_value.size() == 1){
if(new_value[0] == NA_STRING){
to_use = new_value[0];
} else {
holding = new_value[0];
holding_size = holding.size();
if(holding_size < comparator_length){
to_use = holding;
} else {
if(holding.substr((holding_size - comparator_length), comparator_length) == comparator){
to_use = holding.substr(0, (holding_size - comparator_length));
} else {
to_use = holding;
}
}
}
for(unsigned int i = 0; i < input_size; i++){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
output[i] = parsing::set_component(Rcpp::as(urls[i]), component, to_use, false);
}
// If we've got multiple values, it's just a rejigging of the same
} else if(new_value.size() == input_size){
for(unsigned int i = 0; i < input_size; i++){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
if(new_value[i] == NA_STRING){
to_use = new_value[i];
} else {
holding = new_value[i];
holding_size = holding.size();
if(holding_size < comparator_length){
to_use = holding;
} else {
if(holding.substr((holding_size - comparator_length), comparator_length) == comparator){
to_use = holding.substr(0, (holding_size - comparator_length));
} else {
to_use = holding;
}
}
}
output[i] = parsing::set_component(Rcpp::as(urls[i]), component, to_use, false);
}
} else {
Rcpp::stop("The number of new values must either be 1, or match the number of URLs");
}
return output;
}
//[[Rcpp::export]]
CharacterVector set_component_f(CharacterVector urls, int component,
CharacterVector new_value,
std::string comparator){
// Output object
unsigned int input_size = urls.size();
CharacterVector output(input_size);
// Comparator checking objects
std::string holding;
String to_use;
unsigned int holding_size;
unsigned int comparator_length = comparator.size();
// Otherwise, if we've got a single value, iterate
if(new_value.size() == 1){
if(new_value[0] == NA_STRING){
to_use = new_value[0];
} else {
holding = new_value[0];
holding_size = holding.size();
if(holding_size < comparator_length){
to_use = holding;
} else {
if(holding.substr(0, comparator_length) == comparator){
to_use = holding.substr(comparator_length, (holding_size - comparator_length));
} else {
to_use = holding;
}
}
}
for(unsigned int i = 0; i < input_size; i++){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
output[i] = parsing::set_component(Rcpp::as(urls[i]), component, to_use, false);
}
// If we've got multiple values, it's just a rejigging of the same
} else if(new_value.size() == input_size){
for(unsigned int i = 0; i < input_size; i++){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
if(new_value[i] == NA_STRING){
to_use = new_value[i];
} else {
holding = new_value[i];
holding_size = holding.size();
if(holding_size < comparator_length){
to_use = holding;
} else {
if(holding.substr(0, comparator_length) == comparator){
to_use = holding.substr(comparator_length, (holding_size - comparator_length));
} else {
to_use = holding;
}
}
}
output[i] = parsing::set_component(Rcpp::as(urls[i]), component, to_use, false);
}
} else {
Rcpp::stop("The number of new values must either be 1, or match the number of URLs");
}
return output;
}
//[[Rcpp::export]]
CharacterVector rm_component_(CharacterVector urls, int component){
if(component < 2){
Rcpp::stop("Scheme and domain are required components");
}
unsigned int input_size = urls.size();
CharacterVector output(input_size);
for (unsigned int i = 0; i < input_size; ++i){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
output[i] = parsing::set_component(Rcpp::as(urls[i]), component, NA_STRING, true);
}
return output;
}
urltools/src/credentials.cpp 0000644 0001762 0000144 00000005543 13230557631 015741 0 ustar ligges users #include
using namespace Rcpp;
std::string strip_single(std::string x){
std::size_t scheme_loc = x.find("://");
if(scheme_loc == std::string::npos){
return x;
}
std::size_t cred_loc = x.find("@");
if(cred_loc == std::string::npos){
return x;
}
if(scheme_loc > cred_loc){
return x;
}
return x.substr(0, scheme_loc+3) + x.substr(cred_loc+1);
}
//'@title Get or remove user authentication credentials
//'@description authentication credentials appear before the domain
//'name and look like \emph{user:password}. Sometimes you want the removed,
//'or retrieved; \code{strip_credentials} and \code{get_credentials} do
//'precisely that
//'
//'@aliases creds
//'@rdname creds
//'
//'@param urls a URL, or vector of URLs
//'
//'@examples
//'# Remove credentials
//'strip_credentials("http://foo:bar@97.77.104.22:3128")
//'
//'# Get credentials
//'get_credentials("http://foo:bar@97.77.104.22:3128")
//'@export
//[[Rcpp::export]]
CharacterVector strip_credentials(CharacterVector urls){
std::string holding;
unsigned int input_size = urls.size();
CharacterVector output(input_size);
for(unsigned int i = 0; i < input_size; i++){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
if(urls[i] == NA_STRING){
output[i] = NA_STRING;
} else {
output[i] = strip_single(Rcpp::as(urls[i]));
}
}
return output;
}
void get_single(std::string x, CharacterVector& username, CharacterVector& data, unsigned int& i){
std::size_t scheme_loc = x.find("://");
if(scheme_loc == std::string::npos){
username[i] = NA_STRING;
data[i] = NA_STRING;
return;
}
std::size_t cred_loc = x.find("@");
if(cred_loc == std::string::npos){
username[i] = NA_STRING;
data[i] = NA_STRING;
return;
}
if(scheme_loc > cred_loc){
username[i] = NA_STRING;
data[i] = NA_STRING;
return;
}
std::string holding = x.substr(scheme_loc+3, (cred_loc - (scheme_loc+3)));
std::size_t info = holding.find(":");
if(info == std::string::npos){
username[i] = holding;
data[i] = NA_STRING;
return;
} else {
username[i] = holding.substr(0, info);
data[i] = holding.substr(info+1);
return;
}
}
//'@rdname creds
//'@export
//[[Rcpp::export]]
DataFrame get_credentials(CharacterVector urls){
unsigned int input_size = urls.size();
CharacterVector user(input_size);
CharacterVector data(input_size);
for(unsigned int i = 0; i < input_size; i++){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
if(urls[i] == NA_STRING){
user[i] = NA_STRING;
data[i] = NA_STRING;
} else {
get_single(Rcpp::as(urls[i]), user, data, i);
}
}
return DataFrame::create(_["username"] = user,
_["authentication"] = data,
_["stringsAsFactors"] = false);
}
urltools/src/encoding.h 0000644 0001762 0000144 00000003364 13230557631 014676 0 ustar ligges users #include
using namespace Rcpp;
#ifndef __ENCODING_INCLUDED__
#define __ENCODING_INCLUDED__
/**
* A namespace for applying percent-encoding to
* arbitrary strings - optimised for URLs, obviously.
*/
namespace encoding{
/**
* A function for taking a hexadecimal element and converting
* it to the equivalent non-hex value. Used in internal_url_decode
*
* @param x a character array representing the hexed value.
*
* @see to_hex for the reverse operation.
*
* @return a string containing the un-hexed value of x.
*/
char from_hex (char x);
/**
* A function for taking a character value and converting
* it to the equivalent hexadecimal value. Used in internal_url_encode.
*
* @param x a character array representing the unhexed value.
*
* @see from_hex for the reverse operation.
*
* @return a string containing the now-hexed value of x.
*/
std::string to_hex(char x);
/**
* A function for decoding URLs. calls from_hex, and is
* in turn called by url_decode in urltools.cpp.
*
* @param url a string representing a percent-encoded URL.
*
* @see internal_url_encode for the reverse operation.
*
* @return a string containing the decoded URL.
*/
std::string internal_url_decode(std::string url);
/**
* A function for encoding URLs. calls to_hex, and is
* in turn called by url_encode in urltools.cpp.
*
* @param url a string representing a URL.
*
* @see internal_url_decode for the reverse operation.
*
* @return a string containing the percent-encoded version of "url".
*/
std::string internal_url_encode(std::string url);
}
#endif
urltools/src/utf8.h 0000644 0001762 0000144 00000000617 13230557631 013774 0 ustar ligges users #ifndef UTF8_H
#define UTF8_H
extern int locale_is_utf8;
/* is c the start of a utf8 sequence? */
#define isutf(c) (((c)&0xC0)!=0x80)
#define UEOF ((uint32_t)-1)
/* convert UTF-8 data to wide character */
size_t u8_toucs(uint32_t *dest, size_t sz, const char *src, size_t srcsz);
/* the opposite conversion */
size_t u8_toutf8(char *dest, size_t sz, const uint32_t *src, size_t srcsz);
#endif
urltools/src/compose.cpp 0000644 0001762 0000144 00000004575 13230557631 015115 0 ustar ligges users #include "compose.h"
bool compose::emptycheck(String element){
if(element == NA_STRING){
return false;
}
return true;
}
std::string compose::compose_single(String scheme, String domain, String port, String path,
String parameter, String fragment){
std::string output;
if(emptycheck(scheme)){
output += scheme;
output += "://";
}
if(emptycheck(domain)){
output += domain;
}
if(emptycheck(port)){
output += ":";
output += port;
}
output += "/";
if(emptycheck(path)){
output += path;
}
if(emptycheck(parameter)){
output += "?";
output += parameter;
}
if(emptycheck(fragment)){
output += "#";
output += fragment;
}
return output;
}
CharacterVector compose::compose_multiple(DataFrame parsed_urls){
CharacterVector schemes = parsed_urls["scheme"];
CharacterVector domains = parsed_urls["domain"];
CharacterVector ports = parsed_urls["port"];
CharacterVector paths = parsed_urls["path"];
CharacterVector parameters = parsed_urls["parameter"];
CharacterVector fragments = parsed_urls["fragment"];
unsigned int input_size = schemes.size();
CharacterVector output(input_size);
for(unsigned int i = 0; i < input_size; i++){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
output[i] = compose_single(schemes[i], domains[i], ports[i], paths[i], parameters[i],
fragments[i]);
}
return output;
}
//'@title Recompose Parsed URLs
//'
//'@description Sometimes you want to take a vector of URLs, parse them, perform
//'some operations and then rebuild them. \code{url_compose} takes a data.frame produced
//'by \code{\link{url_parse}} and rebuilds it into a vector of full URLs (or: URLs as full
//'as the vector initially thrown into url_parse).
//'
//'This is currently a `beta` feature; please do report bugs if you find them.
//'
//'@param parsed_urls a data.frame sourced from \code{\link{url_parse}}
//'
//'@seealso \code{\link{scheme}} and other accessors, which you may want to
//'run URLs through before composing them to modify individual values.
//'
//'@examples
//'#Parse a URL and compose it
//'url <- "http://en.wikipedia.org"
//'url_compose(url_parse(url))
//'
//'@export
//[[Rcpp::export]]
CharacterVector url_compose(DataFrame parsed_urls){
return compose::compose_multiple(parsed_urls);
}
urltools/src/encoding.cpp 0000644 0001762 0000144 00000015152 13230557631 015227 0 ustar ligges users #include
#include "encoding.h"
using namespace Rcpp;
char encoding::from_hex (char x){
if(x <= '9' && x >= '0'){
x -= '0';
} else if(x <= 'f' && x >= 'a'){
x -= ('a' - 10);
} else if(x <= 'F' && x >= 'A'){
x -= ('A' - 10);
} else {
x = -1;
}
return x;
}
std::string encoding::to_hex(char x){
//Holding objects and output
char digit_1 = (x&0xF0)>>4;
char digit_2 = (x&0x0F);
std::string output;
//Convert
if(0 <= digit_1 && digit_1 <= 9){
digit_1 += 48;
} else if(10 <= digit_1 && digit_1 <=15){
digit_1 += 97-10;
}
if(0 <= digit_2 && digit_2 <= 9){
digit_2 += 48;
} else if(10 <= digit_2 && digit_2 <= 15){
digit_2 += 97-10;
}
output.append(&digit_1, 1);
output.append(&digit_2, 1);
return output;
}
std::string encoding::internal_url_decode(std::string url){
//Create output object
std::string result;
//For each character...
for (std::string::size_type i = 0; i < url.size(); ++i){
//If it's a +, space
if (url[i] == '+'){
result += ' ';
} else if (url[i] == '%' && url.size() > i+2){
//Escaped? Convert from hex and includes
char holding_1 = encoding::from_hex(url[i+1]);
char holding_2 = encoding::from_hex(url[i+2]);
if (holding_1 >= 0 && holding_2 >= 0) {
char holding = (holding_1 << 4) | holding_2;
result += holding;
i += 2;
} else {
result += url[i];
}
} else { //Permitted? Include.
result += url[i];
}
}
//Return
return result;
}
std::string encoding::internal_url_encode(std::string url){
//Note the unreserved characters, create an output string
std::string unreserved_characters = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ._~-";
std::string output = "";
//For each character..
for(int i=0; i < (signed) url.length(); i++){
//If it's in the list of reserved ones, just pass it through
if (unreserved_characters.find_first_of(url[i]) != std::string::npos){
output.append(&url[i], 1);
//Otherwise, append in an encoded form.
} else {
output.append("%");
output.append(to_hex(url[i]));
}
}
//Return
return output;
}
//'@title Encode or decode a URI
//'@description encodes or decodes a URI/URL
//'
//'@param urls a vector of URLs to decode or encode.
//'
//'@details
//'URL encoding and decoding is an essential prerequisite to proper web interaction
//'and data analysis around things like server-side logs. The
//'\href{http://tools.ietf.org/html/rfc3986}{relevant IETF RfC} mandates the percentage-encoding
//'of non-Latin characters, including things like slashes, unless those are reserved.
//'
//'Base R provides \code{\link{URLdecode}} and \code{\link{URLencode}}, which handle
//'URL encoding - in theory. In practise, they have a set of substantial problems
//'that the urltools implementation solves::
//'
//'\itemize{
//' \item{No vectorisation: }{Both base R functions operate on single URLs, not vectors of URLs.
//' This means that, when confronted with a vector of URLs that need encoding or
//' decoding, your only option is to loop from within R. This can be incredibly
//' computationally costly with large datasets. url_encode and url_decode are
//' implemented in C++ and entirely vectorised, allowing for a substantial
//' performance improvement.}
//' \item{No scheme recognition: }{encoding the slashes in, say, http://, is a good way
//' of making sure your URL no longer works. Because of this, the only thing
//' you can encode in URLencode (unless you refuse to encode reserved characters)
//' is a partial URL, lacking the initial scheme, which requires additional operations
//' to set up and increases the complexity of encoding or decoding. url_encode
//' detects the protocol and silently splits it off, leaving it unencoded to ensure
//' that the resulting URL is valid.}
//' \item{ASCII NULs: }{Server side data can get very messy and sometimes include out-of-range
//' characters. Unfortunately, URLdecode's response to these characters is to convert
//' them to NULs, which R can't handle, at which point your URLdecode call breaks.
//' \code{url_decode} simply ignores them.}
//'}
//'
//'@return a character vector containing the encoded (or decoded) versions of "urls".
//'
//'@seealso \code{\link{puny_decode}} and \code{\link{puny_encode}}, for punycode decoding
//'and encoding.
//'
//'@examples
//'
//'url_decode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_%28logo%29.jpg")
//'url_encode("https://en.wikipedia.org/wiki/File:Vice_City_Public_Radio_(logo).jpg")
//'
//'\dontrun{
//'#A demonstrator of the contrasting behaviours around out-of-range characters
//'URLdecode("%gIL")
//'url_decode("%gIL")
//'}
//'@rdname encoder
//'@export
// [[Rcpp::export]]
CharacterVector url_decode(CharacterVector urls){
//Measure size, create output object
int input_size = urls.size();
CharacterVector output(input_size);
//Decode each string in turn.
for (int i = 0; i < input_size; ++i){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
if(urls[i] == NA_STRING){
output[i] = NA_STRING;
} else {
output[i] = encoding::internal_url_decode(Rcpp::as(urls[i]));
}
}
//Return
return output;
}
//'@rdname encoder
//'@export
// [[Rcpp::export]]
CharacterVector url_encode(CharacterVector urls){
//Measure size, create output object and holding objects
int input_size = urls.size();
CharacterVector output(input_size);
std::string holding;
size_t scheme_start;
size_t first_slash;
//For each string..
for (int i = 0; i < input_size; ++i){
//Check for user interrupts.
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
if(urls[i] == NA_STRING){
output[i] = NA_STRING;
} else {
holding = Rcpp::as(urls[i]);
//Extract the protocol. If you can't find it, just encode the entire thing.
scheme_start = holding.find("://");
if(scheme_start == std::string::npos){
output[i] = encoding::internal_url_encode(holding);
} else {
//Otherwise, split out the protocol and encode !protocol.
first_slash = holding.find("/", scheme_start+3);
if(first_slash == std::string::npos){
output[i] = holding.substr(0,scheme_start+3) + encoding::internal_url_encode(holding.substr(scheme_start+3));
} else {
output[i] = holding.substr(0,first_slash+1) + encoding::internal_url_encode(holding.substr(first_slash+1));
}
}
}
}
//Return
return output;
}
urltools/src/parameter.cpp 0000644 0001762 0000144 00000031403 13230557631 015416 0 ustar ligges users #include "parameter.h"
std::deque < std::string > parameter::get_query_string(std::string url){
std::deque < std::string > output;
size_t query_location = url.find("?");
if(query_location == std::string::npos){
output.push_back(url);
} else {
output.push_back(url.substr(0, query_location));
output.push_back(url.substr(query_location));
}
return output;
}
std::string parameter::set_parameter(std::string url, std::string& component, std::string value){
std::deque < std::string > holding = get_query_string(url);
if(holding.size() == 1){
return holding[0] + ("?" + component + "=" + value);
}
size_t component_location = std::string::npos, q_loc, amp_loc;
q_loc = holding[1].find(("?" + component + "="));
if(q_loc == std::string::npos){
amp_loc = holding[1].find(("&" + component + "="));
if(amp_loc != std::string::npos){
component_location = amp_loc + 1;
}
} else {
component_location = q_loc + 1;
}
if(component_location == std::string::npos){
holding[1] = (holding[1] + "&" + component + "=" + value);
} else {
size_t value_location = holding[1].find("&", component_location);
if(value_location == std::string::npos){
holding[1].replace(component_location, value_location, (component + "=" + value));
} else {
holding[1].replace(component_location, (value_location - component_location), (component + "=" + value));
}
}
return(holding[0] + holding[1]);
}
std::string parameter::remove_parameter_single(std::string url, CharacterVector params){
std::deque < std::string > parsed_url = get_query_string(url);
if(parsed_url.size() == 1){
return url;
}
for(unsigned int i = 0; i < params.size(); i++){
if(params[i] != NA_STRING){
size_t param_location = parsed_url[1].find(Rcpp::as(params[i]));
while(param_location != std::string::npos){
size_t end_location = parsed_url[1].find("&", param_location);
parsed_url[1].erase(param_location, end_location);
param_location = parsed_url[i].find(params[i], param_location);
}
}
}
// We may have removed all of the parameters or the last one, leading to trailing ampersands or
// question marks. If those exist, erase them.
if(parsed_url[1][parsed_url[1].size()-1] == '&' || parsed_url[1][parsed_url[1].size()-1] == '?'){
parsed_url[1].erase(parsed_url[1].size()-1);
}
return (parsed_url[0] + parsed_url[1]);
}
// scan for next & separator that is not &
size_t find_ampersand(std::string query, size_t pos = 0) {
while (true) {
size_t amp = query.find_first_of("", pos);
if (amp == std::string::npos) {
pos = amp;
break;
}
if (query[amp] == '#') {
pos = std::string::npos;
break;
}
if (query.compare(amp, 5, "&") == 0) {
pos = amp + 1;
continue;
}
pos = amp;
break;
}
return pos;
}
std::deque < std::string > parameter::get_parameter_names_single(std::string url){
std::deque < std::string > parsed_entry = get_query_string(url);
std::deque < std::string > out;
if(parsed_entry.size() < 2){
return out;
}
std::string query = parsed_entry[1];
size_t amp = 0;
size_t eq;
while(amp != std::string::npos) {
eq = query.find("=", amp);
size_t next_amp = find_ampersand(query, amp+1);
if (eq == std::string::npos) {
amp = next_amp;
continue;
}
if (next_amp != std::string::npos && eq > next_amp) {
amp = next_amp;
continue;
}
out.push_back(query.substr(amp+1, eq-amp-1));
amp = next_amp;
}
return out;
}
CharacterVector parameter::get_parameter_names(CharacterVector &urls) {
std::set < std::string > names;
for (int i = 0; i < urls.length(); i++) {
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
if (urls[i] == R_NaString) {
continue;
}
std::string str = (std::string) urls[i];
std::deque < std::string > labels = get_parameter_names_single(str);
for (unsigned int j = 0; j < labels.size(); j++) {
names.insert(labels[j]);
}
}
CharacterVector out(names.size());
int ii = 0;
for (std::set< std::string >::iterator i = names.begin();
i != names.end();
ii++, i++) {
out[ii] = *i;
}
return out;
}
String parameter::get_parameter_single(std::string url, std::string& component){
// Extract actual query string
std::deque < std::string > parsed_entry = get_query_string(url);
if(parsed_entry.size() < 2){
return NA_STRING;
}
std::string holding = parsed_entry[1];
int component_size;
// ID where the location is
size_t first_find = holding.find(component);
if(first_find == std::string::npos){
return NA_STRING;
}
if(holding[first_find-1] != '&' && holding[first_find-1] != '?'){
first_find = holding.find("&" + component);
component_size = (component.size() + 1);
if(first_find == std::string::npos){
return NA_STRING;
}
} else {
component_size = component.size();
}
size_t next_location = find_ampersand(holding, first_find + 1);
if(next_location == std::string::npos) {
// check for fragment
next_location = holding.find("#", first_find + component_size);
}
if (next_location == std::string::npos) {
return holding.substr(first_find + component_size);
}
return holding.substr(first_find + component_size, (next_location-(first_find + component_size)));
}
//Parameter retrieval
CharacterVector parameter::get_parameter(CharacterVector& urls, std::string component){
unsigned int input_size = urls.size();
CharacterVector output(input_size);
component = component + "=";
for(unsigned int i = 0; i < input_size; ++i){
if(urls[i] == NA_STRING){
output[i] = NA_STRING;
} else {
output[i] = get_parameter_single(Rcpp::as(urls[i]), component);
}
}
return output;
}
CharacterVector parameter::set_parameter_vectorised(CharacterVector urls, String component,
CharacterVector value){
unsigned int input_size = urls.size();
CharacterVector output(input_size);
if(component != NA_STRING){
std::string component_ref = component.get_cstring();
if(value.size() == input_size){
for(unsigned int i = 0; i < input_size; i++){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
if(urls[i] != NA_STRING && value[i] != NA_STRING){
output[i] = set_parameter(Rcpp::as(urls[i]), component_ref,
Rcpp::as(value[i]));
} else if(value[i] == NA_STRING){
output[i] = urls[i];
} else {
output[i] = NA_STRING;
}
}
} else if(value.size() == 1){
if(value[0] != NA_STRING){
std::string value_ref = Rcpp::as(value[0]);
for(unsigned int i = 0; i < input_size; i++){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
if(urls[i] != NA_STRING){
output[i] = set_parameter(Rcpp::as(urls[i]), component_ref, value_ref);
} else {
output[i] = NA_STRING;
}
}
} else {
return urls;
}
} else {
throw std::range_error("'value' must be the same length as 'urls', or of length 1");
}
} else {
return urls;
}
return output;
}
CharacterVector parameter::remove_parameter_vectorised(CharacterVector urls,
CharacterVector params){
unsigned int input_size = urls.size();
CharacterVector output(input_size);
CharacterVector p_copy = params;
// Generate easily find-able params.
for(unsigned int i = 0; i < p_copy.size(); i++){
if(p_copy[i] != NA_STRING){
p_copy[i] += "=";
}
}
// For each URL, remove those parameters.
for(unsigned int i = 0; i < urls.size(); i++){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
if(urls[i] != NA_STRING){
output[i] = remove_parameter_single(Rcpp::as(urls[i]), p_copy);
} else {
output[i] = NA_STRING;
}
}
// Return
return output;
}
//'@title get the values of a URL's parameters
//'@description URLs can have parameters, taking the form of \code{name=value}, chained together
//'with \code{&} symbols. \code{param_get}, when provided with a vector of URLs and a vector
//'of parameter names, will generate a data.frame consisting of the values of each parameter
//'for each URL.
//'
//'@param urls a vector of URLs
//'
//'@param parameter_names a vector of parameter names. If \code{NULL} (default), will extract
//'all parameters that are present.
//'
//'@return a data.frame containing one column for each provided parameter name. Values that
//'cannot be found within a particular URL are represented by an NA.
//'
//'@examples
//'#A very simple example
//'url <- "https://google.com:80/foo.php?this_parameter=selfreferencing&hiphop=awesome"
//'parameter_values <- param_get(url, c("this_parameter","hiphop"))
//'
//'@seealso \code{\link{url_parse}} for decomposing URLs into their constituent parts and
//'\code{\link{param_set}} for inserting or modifying key/value pairs within a query string.
//'
//'@aliases param_get url_parameter
//'@rdname param_get
//'@export
//[[Rcpp::export]]
List param_get(CharacterVector urls, Nullable parameter_names = R_NilValue){
CharacterVector param_names;
if (parameter_names.isNull()) {
param_names = parameter::get_parameter_names(urls);
} else {
param_names = parameter_names.get();
}
List output;
IntegerVector rownames = Rcpp::seq(1,urls.size());
unsigned int column_count = param_names.size();
for(unsigned int i = 0; i < column_count; ++i){
if((i % 10000) == 0){
Rcpp::checkUserInterrupt();
}
output.push_back(parameter::get_parameter(urls, Rcpp::as(param_names[i])));
}
output.attr("class") = "data.frame";
output.attr("names") = param_names;
output.attr("row.names") = rownames;
return output;
}
//'@title Set the value associated with a parameter in a URL's query.
//'@description URLs often have queries associated with them, particularly URLs for
//'APIs, that look like \code{?key=value&key=value&key=value}. \code{param_set}
//'allows you to modify key/value pairs within query strings, or even add new ones
//'if they don't exist within the URL.
//'
//'@param urls a vector of URLs. These should be decoded (with \code{url_decode})
//'but do not have to have been otherwise manipulated.
//'
//'@param key a string representing the key to modify the value of (or insert wholesale
//'if it doesn't exist within the URL).
//'
//'@param value a value to associate with the key. This can be a single string,
//'or a vector the same length as \code{urls}
//'
//'@return the original vector of URLs, but with modified/inserted key-value pairs. If the
//'URL is \code{NA}, the returned value will be - if the key or value are, no insertion
//'will be made.
//'
//'@examples
//'# Set a URL parameter where there's already a key for that
//'param_set("https://en.wikipedia.org/api.php?action=query", "action", "pageinfo")
//'
//'# Set a URL parameter where there isn't.
//'param_set("https://en.wikipedia.org/api.php?list=props", "action", "pageinfo")
//'
//'@seealso \code{\link{param_get}} to retrieve the values associated with multiple keys in
//'a vector of URLs, and \code{\link{param_remove}} to strip key/value pairs from a URL entirely.
//'
//'@export
//[[Rcpp::export]]
CharacterVector param_set(CharacterVector urls, String key, CharacterVector value){
return parameter::set_parameter_vectorised(urls, key, value);
}
//'@title Remove key-value pairs from query strings
//'@description URLs often have queries associated with them, particularly URLs for
//'APIs, that look like \code{?key=value&key=value&key=value}. \code{param_remove}
//'allows you to remove key/value pairs while leaving the rest of the URL intact.
//'
//'@param urls a vector of URLs. These should be decoded with \code{url_decode} but don't
//'have to have been otherwise processed.
//'
//'@param keys a vector of parameter keys to remove.
//'
//'@return the original URLs but with the key/value pairs specified by \code{keys} removed.
//'If the original URL is \code{NA}, \code{NA} will be returned; if a specified key is \code{NA},
//'nothing will be done with it.
//'
//'@seealso \code{\link{param_set}} to modify values associated with keys, or \code{\link{param_get}}
//'to retrieve those values.
//'
//'@examples
//'# Remove multiple parameters from a URL
//'param_remove(urls = "https://en.wikipedia.org/wiki/api.php?action=list&type=query&format=json",
//' keys = c("action","format"))
//'@export
//[[Rcpp::export]]
CharacterVector param_remove(CharacterVector urls, CharacterVector keys){
return parameter::remove_parameter_vectorised(urls, keys);
}
urltools/src/RcppExports.cpp 0000644 0001762 0000144 00000026633 13230557631 015740 0 ustar ligges users // Generated by using Rcpp::compileAttributes() -> do not edit by hand
// Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
#include
using namespace Rcpp;
// url_compose
CharacterVector url_compose(DataFrame parsed_urls);
RcppExport SEXP _urltools_url_compose(SEXP parsed_urlsSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< DataFrame >::type parsed_urls(parsed_urlsSEXP);
rcpp_result_gen = Rcpp::wrap(url_compose(parsed_urls));
return rcpp_result_gen;
END_RCPP
}
// strip_credentials
CharacterVector strip_credentials(CharacterVector urls);
RcppExport SEXP _urltools_strip_credentials(SEXP urlsSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
rcpp_result_gen = Rcpp::wrap(strip_credentials(urls));
return rcpp_result_gen;
END_RCPP
}
// get_credentials
DataFrame get_credentials(CharacterVector urls);
RcppExport SEXP _urltools_get_credentials(SEXP urlsSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
rcpp_result_gen = Rcpp::wrap(get_credentials(urls));
return rcpp_result_gen;
END_RCPP
}
// url_decode
CharacterVector url_decode(CharacterVector urls);
RcppExport SEXP _urltools_url_decode(SEXP urlsSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
rcpp_result_gen = Rcpp::wrap(url_decode(urls));
return rcpp_result_gen;
END_RCPP
}
// url_encode
CharacterVector url_encode(CharacterVector urls);
RcppExport SEXP _urltools_url_encode(SEXP urlsSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
rcpp_result_gen = Rcpp::wrap(url_encode(urls));
return rcpp_result_gen;
END_RCPP
}
// param_get
List param_get(CharacterVector urls, Nullable parameter_names);
RcppExport SEXP _urltools_param_get(SEXP urlsSEXP, SEXP parameter_namesSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
Rcpp::traits::input_parameter< Nullable >::type parameter_names(parameter_namesSEXP);
rcpp_result_gen = Rcpp::wrap(param_get(urls, parameter_names));
return rcpp_result_gen;
END_RCPP
}
// param_set
CharacterVector param_set(CharacterVector urls, String key, CharacterVector value);
RcppExport SEXP _urltools_param_set(SEXP urlsSEXP, SEXP keySEXP, SEXP valueSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
Rcpp::traits::input_parameter< String >::type key(keySEXP);
Rcpp::traits::input_parameter< CharacterVector >::type value(valueSEXP);
rcpp_result_gen = Rcpp::wrap(param_set(urls, key, value));
return rcpp_result_gen;
END_RCPP
}
// param_remove
CharacterVector param_remove(CharacterVector urls, CharacterVector keys);
RcppExport SEXP _urltools_param_remove(SEXP urlsSEXP, SEXP keysSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
Rcpp::traits::input_parameter< CharacterVector >::type keys(keysSEXP);
rcpp_result_gen = Rcpp::wrap(param_remove(urls, keys));
return rcpp_result_gen;
END_RCPP
}
// url_parse
DataFrame url_parse(CharacterVector urls);
RcppExport SEXP _urltools_url_parse(SEXP urlsSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
rcpp_result_gen = Rcpp::wrap(url_parse(urls));
return rcpp_result_gen;
END_RCPP
}
// get_component_
CharacterVector get_component_(CharacterVector urls, int component);
RcppExport SEXP _urltools_get_component_(SEXP urlsSEXP, SEXP componentSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
Rcpp::traits::input_parameter< int >::type component(componentSEXP);
rcpp_result_gen = Rcpp::wrap(get_component_(urls, component));
return rcpp_result_gen;
END_RCPP
}
// set_component_
CharacterVector set_component_(CharacterVector urls, int component, CharacterVector new_value);
RcppExport SEXP _urltools_set_component_(SEXP urlsSEXP, SEXP componentSEXP, SEXP new_valueSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
Rcpp::traits::input_parameter< int >::type component(componentSEXP);
Rcpp::traits::input_parameter< CharacterVector >::type new_value(new_valueSEXP);
rcpp_result_gen = Rcpp::wrap(set_component_(urls, component, new_value));
return rcpp_result_gen;
END_RCPP
}
// set_component_r
CharacterVector set_component_r(CharacterVector urls, int component, CharacterVector new_value, std::string comparator);
RcppExport SEXP _urltools_set_component_r(SEXP urlsSEXP, SEXP componentSEXP, SEXP new_valueSEXP, SEXP comparatorSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
Rcpp::traits::input_parameter< int >::type component(componentSEXP);
Rcpp::traits::input_parameter< CharacterVector >::type new_value(new_valueSEXP);
Rcpp::traits::input_parameter< std::string >::type comparator(comparatorSEXP);
rcpp_result_gen = Rcpp::wrap(set_component_r(urls, component, new_value, comparator));
return rcpp_result_gen;
END_RCPP
}
// set_component_f
CharacterVector set_component_f(CharacterVector urls, int component, CharacterVector new_value, std::string comparator);
RcppExport SEXP _urltools_set_component_f(SEXP urlsSEXP, SEXP componentSEXP, SEXP new_valueSEXP, SEXP comparatorSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
Rcpp::traits::input_parameter< int >::type component(componentSEXP);
Rcpp::traits::input_parameter< CharacterVector >::type new_value(new_valueSEXP);
Rcpp::traits::input_parameter< std::string >::type comparator(comparatorSEXP);
rcpp_result_gen = Rcpp::wrap(set_component_f(urls, component, new_value, comparator));
return rcpp_result_gen;
END_RCPP
}
// rm_component_
CharacterVector rm_component_(CharacterVector urls, int component);
RcppExport SEXP _urltools_rm_component_(SEXP urlsSEXP, SEXP componentSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type urls(urlsSEXP);
Rcpp::traits::input_parameter< int >::type component(componentSEXP);
rcpp_result_gen = Rcpp::wrap(rm_component_(urls, component));
return rcpp_result_gen;
END_RCPP
}
// puny_encode
CharacterVector puny_encode(CharacterVector x);
RcppExport SEXP _urltools_puny_encode(SEXP xSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type x(xSEXP);
rcpp_result_gen = Rcpp::wrap(puny_encode(x));
return rcpp_result_gen;
END_RCPP
}
// puny_decode
CharacterVector puny_decode(CharacterVector x);
RcppExport SEXP _urltools_puny_decode(SEXP xSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type x(xSEXP);
rcpp_result_gen = Rcpp::wrap(puny_decode(x));
return rcpp_result_gen;
END_RCPP
}
// reverse_strings
CharacterVector reverse_strings(CharacterVector strings);
RcppExport SEXP _urltools_reverse_strings(SEXP stringsSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type strings(stringsSEXP);
rcpp_result_gen = Rcpp::wrap(reverse_strings(strings));
return rcpp_result_gen;
END_RCPP
}
// finalise_suffixes
DataFrame finalise_suffixes(CharacterVector full_domains, CharacterVector suffixes, LogicalVector wildcard, LogicalVector is_suffix);
RcppExport SEXP _urltools_finalise_suffixes(SEXP full_domainsSEXP, SEXP suffixesSEXP, SEXP wildcardSEXP, SEXP is_suffixSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type full_domains(full_domainsSEXP);
Rcpp::traits::input_parameter< CharacterVector >::type suffixes(suffixesSEXP);
Rcpp::traits::input_parameter< LogicalVector >::type wildcard(wildcardSEXP);
Rcpp::traits::input_parameter< LogicalVector >::type is_suffix(is_suffixSEXP);
rcpp_result_gen = Rcpp::wrap(finalise_suffixes(full_domains, suffixes, wildcard, is_suffix));
return rcpp_result_gen;
END_RCPP
}
// tld_extract_
CharacterVector tld_extract_(CharacterVector domains);
RcppExport SEXP _urltools_tld_extract_(SEXP domainsSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type domains(domainsSEXP);
rcpp_result_gen = Rcpp::wrap(tld_extract_(domains));
return rcpp_result_gen;
END_RCPP
}
// host_extract_
CharacterVector host_extract_(CharacterVector domains);
RcppExport SEXP _urltools_host_extract_(SEXP domainsSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< CharacterVector >::type domains(domainsSEXP);
rcpp_result_gen = Rcpp::wrap(host_extract_(domains));
return rcpp_result_gen;
END_RCPP
}
static const R_CallMethodDef CallEntries[] = {
{"_urltools_url_compose", (DL_FUNC) &_urltools_url_compose, 1},
{"_urltools_strip_credentials", (DL_FUNC) &_urltools_strip_credentials, 1},
{"_urltools_get_credentials", (DL_FUNC) &_urltools_get_credentials, 1},
{"_urltools_url_decode", (DL_FUNC) &_urltools_url_decode, 1},
{"_urltools_url_encode", (DL_FUNC) &_urltools_url_encode, 1},
{"_urltools_param_get", (DL_FUNC) &_urltools_param_get, 2},
{"_urltools_param_set", (DL_FUNC) &_urltools_param_set, 3},
{"_urltools_param_remove", (DL_FUNC) &_urltools_param_remove, 2},
{"_urltools_url_parse", (DL_FUNC) &_urltools_url_parse, 1},
{"_urltools_get_component_", (DL_FUNC) &_urltools_get_component_, 2},
{"_urltools_set_component_", (DL_FUNC) &_urltools_set_component_, 3},
{"_urltools_set_component_r", (DL_FUNC) &_urltools_set_component_r, 4},
{"_urltools_set_component_f", (DL_FUNC) &_urltools_set_component_f, 4},
{"_urltools_rm_component_", (DL_FUNC) &_urltools_rm_component_, 2},
{"_urltools_puny_encode", (DL_FUNC) &_urltools_puny_encode, 1},
{"_urltools_puny_decode", (DL_FUNC) &_urltools_puny_decode, 1},
{"_urltools_reverse_strings", (DL_FUNC) &_urltools_reverse_strings, 1},
{"_urltools_finalise_suffixes", (DL_FUNC) &_urltools_finalise_suffixes, 4},
{"_urltools_tld_extract_", (DL_FUNC) &_urltools_tld_extract_, 1},
{"_urltools_host_extract_", (DL_FUNC) &_urltools_host_extract_, 1},
{NULL, NULL, 0}
};
RcppExport void R_init_urltools(DllInfo *dll) {
R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
R_useDynamicSymbols(dll, FALSE);
}
urltools/NAMESPACE 0000644 0001762 0000144 00000001334 13230556700 013356 0 ustar ligges users # Generated by roxygen2: do not edit by hand
export("domain<-")
export("fragment<-")
export("parameters<-")
export("path<-")
export("port<-")
export("scheme<-")
export(domain)
export(fragment)
export(get_credentials)
export(host_extract)
export(param_get)
export(param_remove)
export(param_set)
export(parameters)
export(path)
export(port)
export(puny_decode)
export(puny_encode)
export(scheme)
export(strip_credentials)
export(suffix_extract)
export(suffix_refresh)
export(tld_extract)
export(tld_refresh)
export(url_compose)
export(url_decode)
export(url_encode)
export(url_parse)
import(methods)
importFrom(Rcpp,sourceCpp)
importFrom(triebeard,longest_match)
importFrom(triebeard,trie)
useDynLib(urltools, .registration = TRUE)
urltools/NEWS 0000644 0001762 0000144 00000021142 13230556775 012651 0 ustar ligges users Version 1.7.0
-------------------------------------------------------------------------
FEATURES
* get_credentials() and strip_credentials() retrieve and remove user authentication info
from URLs, as appropriate (#64)
* A small performance improvement to parameter-related functions.
* The long-deprecated url_parameters has been removed
* param_get() can retrieve all available parameters.
* get_param(urls, parameter_names = NULL), scans for all available parameters (#82)
* Cases like get_param("http://foo.bar/?field=value&more", "field") retrieve the whole field (#82)
* Cases like get_param("http://foo.bar/?field=value#fragment", "field") don't retrieve the fragment (#82)
* parameters and other URL components can now be removed by assigning NULL to the component. See the vignettes for an example (#79)
* Setting components with eg path() is now fully vectorised (#76)
BUGS
* url_parse now handles URLs with user auth/ident information in them, stripping it out (#64)
* url_parse now handles URLs that are IPv6 addresses (#65)
* url_parse now handles URLs with @ characters in query fragments
* url_parse now handles URLs with fragments, but no parameter or path (#83)
* url_compose now handles URLs with query strings but no paths (#71)
* param_set() can now handle parameters with overlapping names
* url_parse now handles URLs with >6 character schemes (#81)
* param_get() skips the URL fragment and handles values containing "&".
* It is no longer possible to duplicate separators with function calls like path(url) <- '/path'. (#78)
* A small bugfix in url decoding to avoid a situation where URLs that (wrongly) had percentage signs in them decoded the characters behind those signs (#87)
DEVELOPMENT
* Internal API simplified.
Version 1.6.0
-------------------------------------------------------------------------
FEATURES
* Fully punycode encoding and decoding support, thanks to Drew Schmidt.
* param_get, param_set and param_remove are all fully capable of handling NA values.
* component setting functions can now assign even when the previous value was NA.
Version 1.5.2
-------------------------------------------------------------------------
BUGS
* Custom suffix lists were not working properly.
Version 1.5.1
-------------------------------------------------------------------------
BUGS
* Fixed a bug in which punycode TLDs were excluded from TLD extraction (thanks to
Alex Pinto for pointing that out) #51
* param_get now returns NAs for missing values, rather than empty strings (thanks to Josh Izzard for the report) #49
* suffix_extract now no longer goofs if the domain+suffix combo overlaps with a valid suffix (thanks to Maryam Khezrzadeh and Alex Pinto) #50
DEVELOPMENT
* Removed the non-portable -g compiler flag in response to CRAN feedback.
Version 1.5.0
-------------------------------------------------------------------------
FEATURES
* Using tries as a data structure (see https://github.com/Ironholds/triebeard), we've increased the speed of suffix_extract() (instead of taking twenty seconds to process a million domains, it now takes..one.)
* A dataset of top-level domains (TLDs) is now available as data(tld_dataset)
* suffix_refresh() has been reinstated, and can be used with suffix_extract() to ensure suffix
extraction is done with the most up-to-date dataset version possible.
* tld_extract() and tld_refresh() mirrors the functionality of suffix_extract() and suffix_refresh()
BUG FIXES
* host_extract() lets you get the host (the lowest-level subdomain, or the domain itself if no subdomain
is present) from the `domain` fragment of a parsed URL.
* Code from Jay Jacobs has allowed us to include a best-guess at the org name in the suffix dataset.
* url_parameters is now deprecated, and has been marked as such.
DEVELOPMENT
* The instantiation and processing of suffix and TLD datasets on load marginally increases
the speed of both (if you're calling suffix/TLD related functions more than once a sessions)
Version 1.4.0
-------------------------------------------------------------------------
BUG FIXES
* Full NA support is now available!
DEVELOPMENT
* A substantial (20%) speed increase is now available thanks to internal
refactoring.
Version 1.3.3
-------------------------------------------------------------------------
BUG FIXES
* url_parse no longer lower-cases URLs (case sensitivity is Important) thanks to GitHub user 17843
DOCUMENTATION
* A note on NAs (as reported by Alex Pinto) added to the vignette
* Mention Bob Rudis's 'punycode' package.
Version 1.3.2
-------------------------------------------------------------------------
BUG FIXES
* Fixed a critical bug impacting URLs with colons in the path
Version 1.3.1
-------------------------------------------------------------------------
CHANGES
* suffix_refresh has been removed, since LazyData's parameters prevented it from functioning; thanks to
Alex Pinto for the initial bug report and Hadley Wickham for confirming the possible solutions.
BUG FIXES
* the parser was not properly handling ports; thanks to a report from Rich FitzJohn, this is now fixed.
Version 1.3.0
-------------------------------------------------------------------------
NEW FEATURES
* param_set() for inserting or modifying key/value pairs in URL query strings.
* param_remove() added for stripping key/value pairs out of URL query strings.
CHANGES
* url_parameters has been renamed param_get() under the new naming scheme - url_parameters still exists, however,
for the purpose of backwards-compatibiltiy.
BUG FIXES
* Fixed a bug reported by Alex Pinto whereby URLs with parameters but no paths would not have their domain
correctly parsed.
Version 1.2.1
-------------------------------------------------------------------------
CHANGES
* Changed "tld" column to "suffix" in return of "suffix_extract" to more
accurately reflect what it is
* Switched to "vapply" in "suffix_extract" to give a bit of a speedup to
an already fast function
BUG FIXES
* Fixed documentation of "suffix_extract"
DEVELOPMENT
* More internal documentation added to compiled code.
* The suffix_dataset dataset was refreshed
Version 1.2.0
-------------------------------------------------------------------------
NEW FEATURES
* Jay Jacobs' "tldextract" functionality has been merged with urltools, and can be accessed
with "suffix_extract"
* At Nicolas Coutin's suggestion, url_compose - url_parse in reverse - has been introduced.
BUG FIXES
* To adhere to RfC standards, "query" functions have been renamed "parameter"
* A bug in which fragments could not be retrieved (and were incorrectly identified as parameters)
has been fixed. Thanks to Nicolas Coutin for reporting it and providing a reproducible example.
Version 1.1.1
-------------------------------------------------------------------------
BUG FIXES
* Parameter parsing now fixed to require a = after the parameter name, thus solving for scenarios where
the URL would contain the parameter name as part of, say, the domain, and it'd grab the wrong thing. Thanks
to Jacob Barnett for the bug report and example.
* URL encoding no longer encodes the slash between the domain and path (thanks to Peter Meissner for pointing
this bug out).
DEVELOPMENT
*More unit tests
Version 1.1.0
-------------------------------------------------------------------------
NEW FEATURES
*url_parameters provides the values of specified parameters within a vector of URLs, as a data.frame
*KeyboardInterrupts are now available for interrupting long computations.
*url_parse now provides a data.frame, rather than a list, as output.
BUG FIXES
DEVELOPMENT
*De-static the hell out of all the C++.
*Internal refactor to store each logical stage of url decomposition as its own method
*Internal refactor to use references, minimising memory usage; thanks to Mark Greenaway for making this work!
*Roxygen upgrade
Version 1.0.0
-------------------------------------------------------------------------
NEW FEATURES
*New get/set functionality, mimicking lubridate; see the package vignette.
DEVELOPMENT
*Internal C++ documentation added and the encoders and parsers refactored.
Version 0.6.0
-------------------------------------------------------------------------
NEW FEATURES
*replace_parameter introduced, to augment extract_parameter (previously simply url_param). This
allows you to take the value a parameter has associated with it, and replace it with one of your choosing.
*extract_host allows you to grab the hostname of a site, ignoring other components.
BUG FIXES
*extract_parameter (now url_extract_param) previously failed with an obfuscated error if the requested
parameter terminated the URL. This has now been fixed.
DEVELOPMENT
*unit tests expanded
*Internal tweaks to improve the speed of url_decode and url_encode. urltools/data/ 0000755 0001762 0000144 00000000000 13230556700 013047 5 ustar ligges users urltools/data/suffix_dataset.rda 0000644 0001762 0000144 00000110160 13230556700 016547 0 ustar ligges users BZh91AY&SYfe