rematch2/0000755000176200001440000000000013652743232011766 5ustar liggesusersrematch2/NAMESPACE0000644000176200001440000000052713652522166013212 0ustar liggesusers# Generated by roxygen2: do not edit by hand S3method("$",rematch_allrecords) S3method("$",rematch_records) export("$.rematch_allrecords") export("$.rematch_records") export(bind_re_match) export(bind_re_match_) export(re_exec) export(re_exec_all) export(re_match) export(re_match_all) importFrom(tibble,new_tibble) importFrom(tibble,tibble) rematch2/LICENSE0000644000176200001440000000010213225166704012763 0ustar liggesusersYEAR: 2016-2017 COPYRIGHT HOLDER: Mango Solutions, Gábor Csárdi rematch2/README.md0000644000176200001440000001607713637633662013270 0ustar liggesusers # rematch2 > Match Regular Expressions with a Nicer 'API' [![Linux Build Status](https://travis-ci.org/r-lib/rematch2.svg?branch=master)](https://travis-ci.org/r-lib/rematch2) [![Windows Build status](https://ci.appveyor.com/api/projects/status/github/r-lib/rematch2?svg=true)](https://ci.appveyor.com/project/gaborcsardi/rematch2) [![](http://www.r-pkg.org/badges/version/rematch2)](http://www.r-pkg.org/pkg/rematch2) [![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/rematch2)](http://www.r-pkg.org/pkg/rematch2) [![Coverage Status](https://img.shields.io/codecov/c/github/r-lib/rematch2/master.svg)](https://codecov.io/github/r-lib/rematch2?branch=master) A small wrapper on regular expression matching functions `regexpr` and `gregexpr` to return the results in tidy data frames. --- - [Installation](#installation) - [Rematch vs rematch2](#rematch-vs-rematch2) - [Usage](#usage) - [First match](#first-match) - [All matches](#all-matches) - [Match positions](#match-positions) - [License](#license) ## Installation ```r install.packages("rematch2") ``` ## Rematch vs rematch2 Note that `rematch2` is not compatible with the original `rematch` package. There are at least three major changes: * The order of the arguments for the functions is different. In `rematch2` the `text` vector is first, and `pattern` is second. * In the result, `.match` is the last column instead of the first. * `rematch2` returns `tibble` data frames. See https://github.com/hadley/tibble. ## Usage ### First match ```r library(rematch2) ``` With capture groups: ```r dates <- c("2016-04-20", "1977-08-08", "not a date", "2016", "76-03-02", "2012-06-30", "2015-01-21 19:58") isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])" re_match(text = dates, pattern = isodate) ``` ``` #> # A tibble: 7 x 5 #> `` `` `` .text .match #> #> 1 2016 04 20 2016-04-20 2016-04-20 #> 2 1977 08 08 1977-08-08 1977-08-08 #> 3 not a date #> 4 2016 #> 5 76-03-02 #> 6 2012 06 30 2012-06-30 2012-06-30 #> 7 2015 01 21 2015-01-21 19:58 2015-01-21 ``` Named capture groups: ```r isodaten <- "(?[0-9]{4})-(?[0-1][0-9])-(?[0-3][0-9])" re_match(text = dates, pattern = isodaten) ``` ``` #> # A tibble: 7 x 5 #> year month day .text .match #> #> 1 2016 04 20 2016-04-20 2016-04-20 #> 2 1977 08 08 1977-08-08 1977-08-08 #> 3 not a date #> 4 2016 #> 5 76-03-02 #> 6 2012 06 30 2012-06-30 2012-06-30 #> 7 2015 01 21 2015-01-21 19:58 2015-01-21 ``` A slightly more complex example: ```r github_repos <- c( "metacran/crandb", "jeroenooms/curl@v0.9.3", "jimhester/covr#47", "hadley/dplyr@*release", "r-lib/remotes@550a3c7d3f9e1493a2ba", "/$&@R64&3" ) owner_rx <- "(?:(?[^/]+)/)?" repo_rx <- "(?[^/@#]+)" subdir_rx <- "(?:/(?[^@#]*[^@#/]))?" ref_rx <- "(?:@(?[^*].*))" pull_rx <- "(?:#(?[0-9]+))" release_rx <- "(?:@(?[*]release))" subtype_rx <- sprintf("(?:%s|%s|%s)?", ref_rx, pull_rx, release_rx) github_rx <- sprintf( "^(?:%s%s%s%s|(?.*))$", owner_rx, repo_rx, subdir_rx, subtype_rx ) re_match(text = github_repos, pattern = github_rx) ``` ``` #> # A tibble: 6 x 9 #> owner repo subdir ref pull release catchall #> #> 1 metacran crandb #> 2 jeroenooms curl v0.9.3 #> 3 jimhester covr 47 #> 4 hadley dplyr *release #> 5 r-lib remotes 550a3c7d3f9e1493a2ba #> 6 /$&@R64&3 #> # ... with 2 more variables: .text , .match ``` ### All matches Extract all names, and also first names and last names: ```r name_rex <- paste0( "(?[[:upper:]][[:lower:]]+) ", "(?[[:upper:]][[:lower:]]+)" ) notables <- c( " Ben Franklin and Jefferson Davis", "\tMillard Fillmore" ) not <- re_match_all(notables, name_rex) not ``` ``` #> # A tibble: 2 x 4 #> first last .text .match #> #> 1 Ben Franklin and Jefferson Davis #> 2 "\tMillard Fillmore" ``` ```r not$first ``` ``` #> [[1]] #> [1] "Ben" "Jefferson" #> #> [[2]] #> [1] "Millard" ``` ```r not$last ``` ``` #> [[1]] #> [1] "Franklin" "Davis" #> #> [[2]] #> [1] "Fillmore" ``` ```r not$.match ``` ``` #> [[1]] #> [1] "Ben Franklin" "Jefferson Davis" #> #> [[2]] #> [1] "Millard Fillmore" ``` ### Match positions `re_exec` and `re_exec_all` are similar to `re_match` and `re_match_all`, but they also return match positions. These functions return match records. A match record has three components: `match`, `start`, `end`, and each component can be a vector. It is similar to a data frame in this respect. ```r pos <- re_exec(notables, name_rex) pos ``` ``` #> # A tibble: 2 x 4 #> first last .text .match #> * #> 1 Ben Franklin and Jefferson Davis #> 2 "\tMillard Fillmore" ``` Unfortunately R does not allow hierarchical data frames (i.e. a column of a data frame cannot be another data frame), but `rematch2` defines some special classes and an `$` operator, to make it easier to extract parts of `re_exec` and `re_exec_all` matches. You simply query the `match`, `start` or `end` part of a column: ```r pos$first$match ``` ``` #> [1] "Ben" "Millard" ``` ```r pos$first$start ``` ``` #> [1] 3 2 ``` ```r pos$first$end ``` ``` #> [1] 5 8 ``` `re_exec_all` is very similar, but these queries return lists, with arbitrary number of matches: ```r allpos <- re_exec_all(notables, name_rex) allpos ``` ``` #> # A tibble: 2 x 4 #> first last .text .match #> #> 1 Ben Franklin and Jefferson Davis #> 2 "\tMillard Fillmore" ``` ```r allpos$first$match ``` ``` #> [[1]] #> [1] "Ben" "Jefferson" #> #> [[2]] #> [1] "Millard" ``` ```r allpos$first$start ``` ``` #> [[1]] #> [1] 3 20 #> #> [[2]] #> [1] 2 ``` ```r allpos$first$end ``` ``` #> [[1]] #> [1] 5 28 #> #> [[2]] #> [1] 8 ``` ## License MIT © Mango Solutions, Gábor Csárdi rematch2/man/0000755000176200001440000000000013652522166012542 5ustar liggesusersrematch2/man/re_match.Rd0000644000176200001440000000370713652522166014622 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/package.R \name{re_match} \alias{re_match} \title{Extract Regular Expression Matches Into a Data Frame} \usage{ re_match(text, pattern, perl = TRUE, ...) } \arguments{ \item{text}{Character vector.} \item{pattern}{A regular expression. See \code{\link[base]{regex}} for more about regular expressions.} \item{perl}{logical should perl compatible regular expressions be used? Defaults to TRUE, setting to FALSE will disable capture groups.} \item{...}{Additional arguments to pass to \code{\link[base]{regexpr}}.} } \value{ A data frame of character vectors: one column per capture group, named if the group was named, and additional columns for the input text and the first matching (sub)string. Each row corresponds to an element in the \code{text} vector. } \description{ \code{re_match} wraps \code{\link[base]{regexpr}} and returns the match results in a convenient data frame. The data frame has one column for each capture group if \code{perl=TRUE}, and one final columns called \code{.match} for the matching (sub)string. The columns of the capture groups are named if the groups themselves are named. } \note{ \code{re_match} uses PCRE compatible regular expressions by default (i.e. \code{perl = TRUE} in \code{\link[base]{regexpr}}). You can switch this off but if you do so capture groups will no longer be reported as they are only supported by PCRE. } \examples{ dates <- c("2016-04-20", "1977-08-08", "not a date", "2016", "76-03-02", "2012-06-30", "2015-01-21 19:58") isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])" re_match(text = dates, pattern = isodate) # The same with named groups isodaten <- "(?[0-9]{4})-(?[0-1][0-9])-(?[0-3][0-9])" re_match(text = dates, pattern = isodaten) } \seealso{ Other tidy regular expression matching: \code{\link{re_exec_all}()}, \code{\link{re_exec}()}, \code{\link{re_match_all}()} } \concept{tidy regular expression matching} rematch2/man/bind_re_match.Rd0000644000176200001440000000323413637633662015620 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/bind_re_match.R \name{bind_re_match} \alias{bind_re_match} \alias{bind_re_match_} \title{Match results from a data frame column and attach results} \usage{ bind_re_match(df, from, ..., keep_match = FALSE) bind_re_match_(df, from, ..., keep_match = FALSE) } \arguments{ \item{df}{A data frame.} \item{from}{Name of column to use as input for \code{\link{re_match}}. \code{\link{bind_re_match}} takes unquoted names, while \code{\link{bind_re_match_}} takes quoted names.} \item{...}{Arguments (including \code{pattern}) to pass to \code{\link{re_match}}.} \item{keep_match}{Should the column \code{.match} be included in the results? Defaults to \code{FALSE}, to avoid column name collisions in the case that \code{\link{bind_re_match}} is called multiple times in succession.} } \description{ Taking a data frame and a column name as input, this function will run \code{\link{re_match}} and bind the results as new columns to the original table., returning a \code{\link[tibble]{tibble}}. This makes it friendly for pipe-oriented programming with \link[magrittr]{magrittr}. } \section{Functions}{ \itemize{ \item \code{bind_re_match_}: Standard-evaluation version that takes a quoted column name. }} \note{ If named capture groups will result in multiple columns with the same column name, \code{\link[tibble]{repair_names}} will be called on the resulting table. } \examples{ match_cars <- tibble::rownames_to_column(mtcars) bind_re_match(match_cars, rowname, "^(?\\\\w+) ?(?.+)?$") } \seealso{ Standard-evaluation version \code{\link{bind_re_match_}} that is suitable for programming. } rematch2/man/re_exec.Rd0000644000176200001440000000606613652522166014453 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/exec.R, R/indexing.R \name{re_exec} \alias{re_exec} \alias{$.rematch_records} \alias{$.rematch_allrecords} \title{Extract Data From First Regular Expression Match Into a Data Frame} \usage{ re_exec(text, pattern, perl = TRUE, ...) \method{$}{rematch_records}(x, name) \method{$}{rematch_allrecords}(x, name) } \arguments{ \item{text}{Character vector.} \item{pattern}{A regular expression. See \code{\link[base]{regex}} for more about regular expressions.} \item{perl}{logical should perl compatible regular expressions be used? Defaults to TRUE, setting to FALSE will disable capture groups.} \item{...}{Additional arguments to pass to \code{\link[base]{gregexpr}} (or \code{\link[base]{regexpr}} if \code{text} is of length zero).} \item{x}{Object returned by \code{re_exec} or \code{re_exec_all}.} \item{name}{\code{match}, \code{start} or \code{end}.} } \value{ A tidy data frame (see Section \dQuote{Tidy Data}). Match record entries are one length vectors that are set to NA if there is no match. } \description{ Match a regular expression to a string, and return matches, match positions, and capture groups. This function is like its \code{\link[=re_match]{match}} counterpart, except it returns match/capture group start and end positions in addition to the matched values. } \section{Tidy Data}{ The return value is a tidy data frame where each row corresponds to an element of the input character vector \code{text}. The values from \code{text} appear for reference in the \code{.text} character column. All other columns are list columns containing the match data. The \code{.match} column contains the match information for full regular expression matches while other columns correspond to capture groups if there are any, and PCRE matches are enabled with \code{perl = TRUE} (this is on by default). If capture groups are named the corresponding columns will bear those names. Each match data column list contains match records, one for each element in \code{text}. A match record is a named list, with entries \code{match}, \code{start} and \code{end} that are respectively the matching (sub) string, the start, and the end positions (using one based indexing). } \section{Extracting Match Data}{ To make it easier to extract matching substrings or positions, a special \code{$} operator is defined on match columns, both for the \code{.match} column and the columns corresponding to the capture groups. See examples below. } \examples{ name_rex <- paste0( "(?[[:upper:]][[:lower:]]+) ", "(?[[:upper:]][[:lower:]]+)" ) notables <- c( " Ben Franklin and Jefferson Davis", "\tMillard Fillmore" ) # Match first occurrence pos <- re_exec(notables, name_rex) pos # Custom $ to extract matches and positions pos$first$match pos$first$start pos$first$end } \seealso{ \code{\link[base]{regexpr}}, which this function wraps Other tidy regular expression matching: \code{\link{re_exec_all}()}, \code{\link{re_match_all}()}, \code{\link{re_match}()} } \concept{tidy regular expression matching} rematch2/man/re_exec_all.Rd0000644000176200001440000000560213652522166015276 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/exec-all.R \name{re_exec_all} \alias{re_exec_all} \title{Extract Data From All Regular Expression Matches Into a Data Frame} \usage{ re_exec_all(text, pattern, perl = TRUE, ...) } \arguments{ \item{text}{Character vector.} \item{pattern}{A regular expression. See \code{\link[base]{regex}} for more about regular expressions.} \item{perl}{logical should perl compatible regular expressions be used? Defaults to TRUE, setting to FALSE will disable capture groups.} \item{...}{Additional arguments to pass to \code{\link[base]{gregexpr}} (or \code{\link[base]{regexpr}} if \code{text} is of length zero).} } \value{ A tidy data frame (see Section \dQuote{Tidy Data}). The entries within the match records within the list columns will be one vectors as long as there are matches for the corresponding text element. } \description{ Match a regular expression to a string, and return matches, match positions, and capture groups. This function is like its \code{\link[=re_match_all]{match}} counterpart, except it returns match/capture group start and end positions in addition to the matched values. } \section{Tidy Data}{ The return value is a tidy data frame where each row corresponds to an element of the input character vector \code{text}. The values from \code{text} appear for reference in the \code{.text} character column. All other columns are list columns containing the match data. The \code{.match} column contains the match information for full regular expression matches while other columns correspond to capture groups if there are any, and PCRE matches are enabled with \code{perl = TRUE} (this is on by default). If capture groups are named the corresponding columns will bear those names. Each match data column list contains match records, one for each element in \code{text}. A match record is a named list, with entries \code{match}, \code{start} and \code{end} that are respectively the matching (sub) string, the start, and the end positions (using one based indexing). } \section{Extracting Match Data}{ To make it easier to extract matching substrings or positions, a special \code{$} operator is defined on match columns, both for the \code{.match} column and the columns corresponding to the capture groups. See examples below. } \examples{ name_rex <- paste0( "(?[[:upper:]][[:lower:]]+) ", "(?[[:upper:]][[:lower:]]+)" ) notables <- c( " Ben Franklin and Jefferson Davis", "\tMillard Fillmore" ) # All occurrences allpos <- re_exec_all(notables, name_rex) allpos # Custom $ to extract matches and positions allpos$first$match allpos$first$start allpos$first$end } \seealso{ \code{\link[base]{gregexpr}}, which this function wraps Other tidy regular expression matching: \code{\link{re_exec}()}, \code{\link{re_match_all}()}, \code{\link{re_match}()} } \concept{tidy regular expression matching} rematch2/man/rematch2-package.Rd0000644000176200001440000000136113652522166016130 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/package.R \docType{package} \name{rematch2-package} \alias{rematch2} \alias{rematch2-package} \title{Match Regular Expressions with a Nicer 'API'} \description{ A small wrapper on 'regexpr' to extract the matches and captured groups from the match of a regular expression to a character vector. See \code{\link{re_match}}. } \seealso{ Useful links: \itemize{ \item \url{https://github.com/r-lib/rematch2#readme} \item Report bugs at \url{https://github.com/r-lib/rematch2/issues} } } \author{ \strong{Maintainer}: Gábor Csárdi \email{csardi.gabor@gmail.com} Other contributors: \itemize{ \item Matthew Lincoln \email{matthew.d.lincoln@gmail.com} [contributor] } } rematch2/man/re_match_all.Rd0000644000176200001440000000507113652522166015446 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/all.R \name{re_match_all} \alias{re_match_all} \title{Extract All Regular Expression Matches Into a Data Frame} \usage{ re_match_all(text, pattern, perl = TRUE, ...) } \arguments{ \item{text}{Character vector.} \item{pattern}{A regular expression. See \code{\link[base]{regex}} for more about regular expressions.} \item{perl}{logical should perl compatible regular expressions be used? Defaults to TRUE, setting to FALSE will disable capture groups.} \item{...}{Additional arguments to pass to \code{\link[base]{gregexpr}} (or \code{\link[base]{regexpr}} if \code{text} is of length zero).} } \value{ A tidy data frame (see Section \dQuote{Tidy Data}). The list columns contain character vectors with as many entries as there are matches for each input element. } \description{ This function is a thin wrapper on the \code{\link[base]{gregexpr}} base R function, to extract the matching (sub)strings as a data frame. It extracts all matches, and potentially their capture groups as well. } \note{ If the input text character vector has length zero, \code{\link[base]{regexpr}} is called instead of \code{\link[base]{gregexpr}}, because the latter cannot extract the number and names of the capture groups in this case. } \section{Tidy Data}{ The return value is a tidy data frame where each row corresponds to an element of the input character vector \code{text}. The values from \code{text} appear for reference in the \code{.text} character column. All other columns are list columns containing the match data. The \code{.match} column contains the match information for full regular expression matches while other columns correspond to capture groups if there are any, and PCRE matches are enabled with \code{perl = TRUE} (this is on by default). If capture groups are named the corresponding columns will bear those names. Each match data column list contains match records, one for each element in \code{text}. A match record is a named list, with entries \code{match}, \code{start} and \code{end} that are respectively the matching (sub) string, the start, and the end positions (using one based indexing). } \examples{ name_rex <- paste0( "(?[[:upper:]][[:lower:]]+) ", "(?[[:upper:]][[:lower:]]+)" ) notables <- c( " Ben Franklin and Jefferson Davis", "\tMillard Fillmore" ) re_match_all(notables, name_rex) } \seealso{ Other tidy regular expression matching: \code{\link{re_exec_all}()}, \code{\link{re_exec}()}, \code{\link{re_match}()} } \concept{tidy regular expression matching} rematch2/DESCRIPTION0000644000176200001440000000150713652743232013477 0ustar liggesusersPackage: rematch2 Title: Tidy Output from Regular Expression Matching Version: 2.1.2 Authors@R: c( person("Gábor", "Csárdi", email = "csardi.gabor@gmail.com", role = c("aut", "cre")), person("Matthew", "Lincoln", email = "matthew.d.lincoln@gmail.com", role = c("ctb"))) Description: Wrappers on 'regexpr' and 'gregexpr' to return the match results in tidy data frames. License: MIT + file LICENSE LazyData: true URL: https://github.com/r-lib/rematch2#readme BugReports: https://github.com/r-lib/rematch2/issues RoxygenNote: 7.1.0 Imports: tibble Suggests: covr, testthat Encoding: UTF-8 NeedsCompilation: no Packaged: 2020-04-30 10:31:13 UTC; gaborcsardi Author: Gábor Csárdi [aut, cre], Matthew Lincoln [ctb] Maintainer: Gábor Csárdi Repository: CRAN Date/Publication: 2020-05-01 06:50:02 UTC rematch2/tests/0000755000176200001440000000000013225166704013127 5ustar liggesusersrematch2/tests/testthat/0000755000176200001440000000000013652743232014770 5ustar liggesusersrematch2/tests/testthat/test-all.R0000644000176200001440000000276613225166704016652 0ustar liggesusers context("re_match_all") test_that("corner cases", { res <- re_match_all(.text <- c("foo", "bar"), "") expect_equal( as.data.frame(res), asdf(.text = .text, .match = list(c("", "", ""), c("", "", ""))) ) res <- re_match_all(.text <- c("", "bar"), "") expect_equal( res, df(.text = .text, .match = list("", c("", "", ""))) ) res <- re_match_all(.text <- character(), "") expect_equal(res, df(.text = .text, .match = list())) res <- re_match_all(.text <- character(), "foo") expect_equal(as.data.frame(res), asdf(.text = .text, .match = list())) res <- re_match_all(.text <- "not", "foo") expect_equal( as.data.frame(res), asdf(.text = .text, .match = list(character())) ) }) test_that("capture groups", { pattern <- "([0-9]+)" res <- re_match_all( .text <- c("123xxxx456", "", "xxx", "1", "123"), pattern ) expect_equal( as.data.frame(res), asdf( list(c("123", "456"), character(), character(), "1", "123"), .text = .text, .match = list(c("123", "456"), character(), character(), "1", "123") ) ) }) test_that("scalar text with capure groups", { res <- re_match_all(.text <- "foo bar", "\\b(\\w+)\\b") expect_equal( res, df(list(c("foo", "bar")), .text = .text, .match = list(c("foo", "bar"))) ) res <- re_match_all(.text <- "foo bar", "\\b(?\\w+)\\b") expect_equal( res, df( word = list(c("foo", "bar")), .text = .text, .match = list(c("foo", "bar")) ) ) }) rematch2/tests/testthat/test-bind.R0000644000176200001440000000154513225166704017010 0ustar liggesusers context("bind_re_match") test_that("normal cases", { match_cars <- tibble::rownames_to_column(mtcars) match_cars_nse <- bind_re_match(match_cars, rowname, "^(?\\w+) ?(?.+)?$") match_cars_se <- bind_re_match_(match_cars, "rowname", "^(?\\w+) ?(?.+)?$") match_cars_nse_with_match <- bind_re_match(match_cars, rowname, "^(?\\w+) ?(?.+)?$", keep_match = TRUE) second_match_cars <- bind_re_match(match_cars_nse_with_match, model, "(?\\d+)", keep_match = TRUE) expect_equal(c(names(match_cars), "make", "model"), names(match_cars_nse)) expect_equal(c(names(match_cars), "make", "model"), names(match_cars_se)) expect_equal(c(names(match_cars), "make", "model", ".match"), names(match_cars_nse_with_match)) expect_equal(c(names(match_cars_nse_with_match), "number", ".match1"), names(second_match_cars)) }) rematch2/tests/testthat/test-exec-all.R0000644000176200001440000000461513652522166017571 0ustar liggesuserstest_that("corner cases", { res <- re_exec_all_val(.text <- c("foo", "bar"), "") expect_equal( as.data.frame(res), asdf( .text = .text, .match = allreclist( list( match = c("", "", ""), start = c(1L, 2L, 3L), end = c(0L, 1L, 2L) ), list( match = c("", "", ""), start = c(1L, 2L, 3L), end = c(0L, 1L, 2L) ) ) ) ) res <- re_exec_all_val(.text <- c("", "bar"), "") expect_equal( as.data.frame(res), asdf( .text = .text, .match = allreclist( list( match = "", start = 1L, end = 0L ), list( match = c("", "", ""), start = c(1L, 2L, 3L), end = c(0L, 1L, 2L) ) ) ) ) res <- re_exec_all_val(.text <- character(), "") expect_equal(as.data.frame(res), asdf(.text = .text, .match = allreclist())) res <- re_exec_all_val(.text <- character(), "foo") expect_equal(as.data.frame(res), asdf(.text = .text, .match = allreclist())) res <- re_exec_all_val(.text <- "not", "foo") expect_equal( as.data.frame(res), asdf( .text = .text, .match = allreclist(mrec(character(), integer(), integer())) ) ) }) test_that("capture groups", { pattern <- "([0-9]+)" res <- re_exec_all_val( .text <- c("123xxxx456", "", "xxx", "1", "123"), pattern ) expect_equal( as.data.frame(res), asdf( allreclist( mrec(c("123", "456"), c(1L, 8L), c(3L, 10L)), norec(), norec(), mrec("1", 1L, 1L), mrec("123", 1, 3) ), .text = .text, .match = allreclist( mrec(c("123", "456"), c(1L, 8L), c(3L, 10L)), norec(), norec(), mrec("1", 1L, 1L), mrec("123", 1, 3) ) ) ) }) test_that("scalar text with capure groups", { res <- re_exec_all_val(.text <- "foo bar", "\\b(\\w+)\\b") expect_equal( as.data.frame(res), asdf( allreclist(mrec(c("foo", "bar"), c(1L, 5L), c(3L, 7L))), .text = .text, .match = allreclist(mrec(c("foo", "bar"), c(1L, 5L), c(3L, 7L))) ) ) res <- re_exec_all_val(.text <- "foo bar", "\\b(?\\w+)\\b") expect_equal( as.data.frame(res), asdf( word = allreclist(mrec(c("foo", "bar"), c(1L, 5L), c(3L, 7L))), .text = .text, .match = allreclist(mrec(c("foo", "bar"), c(1L, 5L), c(3L, 7L))) ) ) }) rematch2/tests/testthat/test-exec.R0000644000176200001440000001056613652522166017025 0ustar liggesuserstest_that("corner cases", { res <- re_exec_val(.text <- c("foo", "bar"), "") expect_equal( as.data.frame(res), asdf(.text = .text, .match = reclist(mrec("", 1, 0), mrec("", 1, 0))) ) res <- re_exec_val(.text <- c("foo", "", "bar"), "") expect_equal( as.data.frame(res), asdf( .text = .text, .match = reclist(mrec("", 1, 0), mrec("", 1, 0), mrec("", 1, 0)) ) ) res <- re_exec_val(.text <- character(), "") expect_equal(as.data.frame(res), asdf(.text = .text, .match = reclist())) res <- re_exec_val(.text <- character(), "foo") expect_equal(as.data.frame(res), asdf(.text = .text, .match = reclist())) res <- re_exec_val(.text <- character(), "foo (g1) (g2)") expect_equal( as.data.frame(res), asdf(reclist(), reclist(), .text = .text, .match = reclist()) ) res <- re_exec_val(.text <- character(), "foo (g1) (?g2)") expect_equal( as.data.frame(res), asdf(reclist(), name = reclist(), .text = .text, .match = reclist()) ) res <- re_exec_val(.text <- "not", "foo") expect_equal( as.data.frame(res), asdf(.text = .text, .match = reclist(mrec(NA, NA, NA))) ) }) test_that("not so corner cases", { dates <- c("2016-04-20", "1977-08-08", "not a date", "2016", "76-03-02", "2012-06-30", "2015-01-21 19:58") isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])" expected <- asdf( reclist( mrec("2016", 1, 4), mrec("1977", 1, 4), narec(), narec(), narec(), mrec("2012", 1, 4), mrec("2015", 1, 4) ), reclist( mrec("04", 6, 7), mrec("08", 6, 7), narec(), narec(), narec(), mrec("06", 6, 7), mrec("01", 6, 7) ), reclist( mrec("20", 9, 10), mrec("08", 9, 10), narec(), narec(), narec(), mrec("30", 9, 10), mrec("21", 9, 10) ), .text = dates, .match = reclist( mrec("2016-04-20", 1, 10), mrec("1977-08-08", 1, 10), narec(), narec(), narec(), mrec("2012-06-30", 1, 10), mrec("2015-01-21", 1, 10) ) ) expect_equal( as.data.frame(re_exec_val(text = dates, pattern = isodate)), expected ) isodaten <- "(?[0-9]{4})-(?[0-1][0-9])-(?[0-3][0-9])" expected <- asdf( year = reclist( mrec("2016", 1, 4), mrec("1977", 1, 4), narec(), narec(), narec(), mrec("2012", 1, 4), mrec("2015", 1, 4) ), month = reclist( mrec("04", 6, 7), mrec("08", 6, 7), narec(), narec(), narec(), mrec("06", 6, 7), mrec("01", 6, 7) ), day = reclist( mrec("20", 9, 10), mrec("08", 9, 10), narec(), narec(), narec(), mrec("30", 9, 10), mrec("21", 9, 10) ), .text = dates, .match = reclist( mrec("2016-04-20", 1, 10), mrec("1977-08-08", 1, 10), narec(), narec(), narec(), mrec("2012-06-30", 1, 10), mrec("2015-01-21", 1, 10) ) ) expect_equal( as.data.frame(re_exec_val(text = dates, pattern = isodaten)), expected ) }) test_that("UTF8", { str <- "Gábor Csárdi" pat <- "Gábor" Encoding(str) <- Encoding(pat) <- "UTF-8" res <- re_exec_val(str, pat) expect_equal( as.data.frame(res), asdf(.text = str, .match = reclist(mrec(pat, 1, 5))) ) }) test_that("text is scalar & capture groups", { res <- re_exec_val(.text <- "foo bar", "(\\w+) (\\w+)") expect_equal( as.data.frame(res), asdf( reclist(mrec("foo", 1, 3)), reclist(mrec("bar", 5, 7)), .text = .text, .match = reclist(mrec("foo bar", 1, 7)) ) ) res <- re_exec_val(.text <- "foo bar", "(?\\w+) (?\\w+)") expect_equal( as.data.frame(res), asdf( g1 = reclist(mrec("foo", 1, 3)), g2 = reclist(mrec("bar", 5, 7)), .text = .text, .match = reclist(mrec("foo bar", 1, 7)) ) ) }) test_that("perl argument", { # using perl=TRUE used to cause an error; not important in this case, but must # be supported if we want this to be a drop in replacement for other functions # (e.g. re-implenting `strsplit` with a rematch2 backend) res <- re_exec_val(.text <- "foo bar", "\\w+", perl = TRUE) expect_equal( as.data.frame(res), asdf( .text = .text, .match = reclist(mrec("foo", 1, 3)) ) ) # actually check that the capture group doesn't show up res.tre <- re_exec_val(.text <- "foo bar", "\\w+ (\\w+)", perl = FALSE) res.perl <- re_exec_val(.text <- "foo bar", "\\w+ (\\w+)", perl= TRUE) expect_true(ncol(as.data.frame(res.perl)) == 3 && ncol(res.tre) == 2) }) rematch2/tests/testthat/helper.R0000644000176200001440000000151213652522166016372 0ustar liggesusers df <- function(...) { args <- list(...) structure( args, names = names(args), row.names = seq_along(args[[1]]), class = c("tbl_df", "tbl", "data.frame") ) } asdf <- function(...) { as.data.frame(df(...)) } mrec <- function(match, start, end) { list( match = as.character(match), start = as.integer(start), end = as.integer(end) ) } narec <- function() { mrec(NA, NA, NA) } norec <- function() { mrec(character(), integer(), integer()) } reclist <- function(...) { new_rematch_records(list(...)) } allreclist <- function(...) { new_rematch_allrecords(list(...)) } re_exec_val <- function(...) { res <- re_exec(...) expect_silent(tibble::validate_tibble(res)) res } re_exec_all_val <- function(...) { res <- re_exec_all(...) expect_silent(tibble::validate_tibble(res)) res } rematch2/tests/testthat/test-indexing.R0000644000176200001440000000456413652522166017707 0ustar liggesuserstest_that("re_exec_val intexing", { res <- re_exec_val(character(), "foo([0-9]+)") expect_identical(res[[1]]$match, character()) expect_identical(res[[1]]$start, integer()) expect_identical(res[[1]]$end, integer()) expect_identical(res$.match$match, character()) expect_identical(res$.match$start, integer()) expect_identical(res$.match$end, integer()) name_rex <- paste0( "(?[[:upper:]][[:lower:]]+) ", "(?[[:upper:]][[:lower:]]+)" ) notables <- c( " Ben Franklin and Jefferson Davis", "\tMillard Fillmore" ) pos <- re_exec_val(notables, name_rex) expect_identical(pos$first$match, c("Ben", "Millard")) expect_identical(pos$first$start, c(3L, 2L)) expect_identical(pos$first$end, c(5L, 8L)) expect_identical(pos$last$match, c("Franklin", "Fillmore")) expect_identical(pos$last$start, c(7L, 10L)) expect_identical(pos$last$end, c(14L, 17L)) expect_identical(pos$.match$match, c("Ben Franklin", "Millard Fillmore")) expect_identical(pos$.match$start, c(3L, 2L)) expect_identical(pos$.match$end, c(14L, 17L)) }) test_that("re_exec_all_val indexing", { name_rex <- paste0( "(?[[:upper:]][[:lower:]]+) ", "(?[[:upper:]][[:lower:]]+)" ) notables <- c( " Ben Franklin and Jefferson Davis", "\tMillard Fillmore" ) allpos <- re_exec_all_val(notables, name_rex) expect_identical( allpos$first$match, list(c("Ben", "Jefferson"), "Millard") ) expect_identical(allpos$first$start, list(c(3L, 20L), 2L)) expect_identical(allpos$first$end, list(c(5L, 28L), 8L)) expect_identical( allpos$last$match, list(c("Franklin", "Davis"), "Fillmore") ) expect_identical(allpos$last$start, list(c(7L, 30L), 10L)) expect_identical(allpos$last$end, list(c(14L, 34L), 17L)) expect_identical( allpos$.match$match, list(c("Ben Franklin", "Jefferson Davis"), "Millard Fillmore") ) expect_identical(allpos$.match$start, list(c(3L, 20L), 2L)) expect_identical(allpos$.match$end, list(c(14L, 34L), 17L)) }) test_that("$ errors", { name_rex <- paste0( "(?[[:upper:]][[:lower:]]+) ", "(?[[:upper:]][[:lower:]]+)" ) notables <- c( " Ben Franklin and Jefferson Davis", "\tMillard Fillmore" ) pos <- re_exec_val(notables, name_rex) allpos <- re_exec_all_val(notables, name_rex) expect_error(pos$first$foo) expect_error(allpos$first$foo) }) rematch2/tests/testthat/test.R0000644000176200001440000000453213225166704016075 0ustar liggesusers context("re_match") test_that("corner cases", { res <- re_match(.text <- c("foo", "bar"), "") expect_equal(res, df(.text = .text, .match = c("", ""))) res <- re_match(.text <- c("foo", "", "bar"), "") expect_equal(res, df(.text = .text, .match = c("", "", ""))) res <- re_match(.text <- character(), "") expect_equal(res, df(.text = .text, .match = character())) res <- re_match(.text <- character(), "foo") expect_equal(res, df(.text = .text, .match = character())) res <- re_match(.text <- character(), "foo (g1) (g2)") expect_equal( res, df(character(), character(), .text = .text, .match = character()) ) res <- re_match(.text <- character(), "foo (g1) (?g2)") expect_equal( res, df(character(), name = character(), .text = .text, .match = character()) ) res <- re_match(.text <- "not", "foo") expect_equal(res, df(.text = .text, .match = NA_character_)) }) test_that("not so corner cases", { dates <- c("2016-04-20", "1977-08-08", "not a date", "2016", "76-03-02", "2012-06-30", "2015-01-21 19:58") isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])" expect_equal( as.data.frame(re_match(text = dates, pattern = isodate)), asdf( c("2016", "1977", NA, NA, NA, "2012", "2015"), c("04", "08", NA, NA, NA, "06", "01"), c("20", "08", NA, NA, NA, "30", "21"), .text = dates, .match = c(dates[1:2], NA, NA, NA, "2012-06-30", "2015-01-21") ) ) isodaten <- "(?[0-9]{4})-(?[0-1][0-9])-(?[0-3][0-9])" expect_equal( re_match(text = dates, pattern = isodaten), df( year = c("2016", "1977", NA, NA, NA, "2012", "2015"), month = c("04", "08", NA, NA, NA, "06", "01"), day = c("20", "08", NA, NA, NA, "30", "21"), .text = dates, .match = c(dates[1:2], NA, NA, NA, "2012-06-30", "2015-01-21") ) ) }) test_that("UTF8", { res <- re_match(.text <- "Gábor Csárdi", "Gábor") expect_equal(res, df(.text = .text, .match = "Gábor")) }) test_that("text is scalar & capture groups", { res <- re_match(.text <- "foo bar", "(\\w+) (\\w+)") expect_equal( as.data.frame(res), asdf("foo", "bar", .text = .text, .match = "foo bar") ) res <- re_match(.text <- "foo bar", "(?\\w+) (?\\w+)") expect_equal( res, df(g1 = "foo", g2 = "bar", .text = .text, .match = "foo bar") ) }) rematch2/tests/testthat.R0000644000176200001440000000011113225166704015103 0ustar liggesusers if (require(testthat)) { library(rematch2) test_check("rematch2") } rematch2/R/0000755000176200001440000000000013652522166012170 5ustar liggesusersrematch2/R/exec.R0000644000176200001440000001016713652522166013244 0ustar liggesusers#' Extract Data From First Regular Expression Match Into a Data Frame #' #' @description #' #' Match a regular expression to a string, and return matches, match positions, #' and capture groups. This function is like its #' \code{\link[=re_match]{match}} counterpart, except it returns match/capture #' group start and end positions in addition to the matched values. #' #' @section Tidy Data: #' #' The return value is a tidy data frame where each row #' corresponds to an element of the input character vector \code{text}. The #' values from \code{text} appear for reference in the \code{.text} character #' column. All other columns are list columns containing the match data. The #' \code{.match} column contains the match information for full regular #' expression matches while other columns correspond to capture groups if there #' are any, and PCRE matches are enabled with \code{perl = TRUE} (this is on by #' default). If capture groups are named the corresponding columns will bear #' those names. #' #' Each match data column list contains match records, one for each element in #' \code{text}. A match record is a named list, with entries \code{match}, #' \code{start} and \code{end} that are respectively the matching (sub) string, #' the start, and the end positions (using one based indexing). #' #' @section Extracting Match Data: #' #' To make it easier to extract matching substrings or positions, a special #' \code{$} operator is defined on match columns, both for the \code{.match} #' column and the columns corresponding to the capture groups. See examples #' below. #' #' @inheritParams re_match_all #' @seealso \code{\link[base]{regexpr}}, which this function wraps #' @param x Object returned by \code{re_exec} or \code{re_exec_all}. #' @param name \code{match}, \code{start} or \code{end}. #' @return A tidy data frame (see Section \dQuote{Tidy Data}). Match record #' entries are one length vectors that are set to NA if there is no match. #' @family tidy regular expression matching #' @export #' @examples #' name_rex <- paste0( #' "(?[[:upper:]][[:lower:]]+) ", #' "(?[[:upper:]][[:lower:]]+)" #' ) #' notables <- c( #' " Ben Franklin and Jefferson Davis", #' "\tMillard Fillmore" #' ) #' # Match first occurrence #' pos <- re_exec(notables, name_rex) #' pos #' #' # Custom $ to extract matches and positions #' pos$first$match #' pos$first$start #' pos$first$end re_exec <- function(text, pattern, perl=TRUE, ...) { stopifnot(is.character(pattern), length(pattern) == 1, !is.na(pattern)) text <- as.character(text) match <- regexpr(pattern, text, perl = perl, ...) start <- as.vector(match) length <- attr(match, "match.length") end <- start + length - 1L matchstr <- substring(text, start, end) matchstr[ start == -1 ] <- NA_character_ end [ start == -1 ] <- NA_integer_ start [ start == -1 ] <- NA_integer_ names <- c("match", "start", "end") matchlist <- new_rematch_records( lapply(seq_along(text), function(i) { structure(list(matchstr[i], start[i], end[i]), names = names) }) ) res <- new_tibble( list(text, matchlist), names = c(".text", ".match"), nrow = length(text) ) if (!is.null(attr(match, "capture.start"))) { gstart <- unname(attr(match, "capture.start")) glength <- unname(attr(match, "capture.length")) gend <- gstart + glength - 1L groupstr <- substring(text, gstart, gend) groupstr[ gstart == -1 ] <- NA_character_ gend [ gstart == -1 ] <- NA_integer_ gstart [ gstart == -1 ] <- NA_integer_ dim(groupstr) <- dim(gstart) grouplists <- lapply( seq_along(attr(match, "capture.names")), function(g) { new_rematch_records( lapply(seq_along(text), function(i) { structure( list(groupstr[i, g], gstart[i, g], gend[i, g]), names = names ) }) ) } ) res <- new_tibble( c(grouplists, res), names = c(attr(match, "capture.names"), ".text", ".match"), nrow = length(res[[1]]) ) } res } new_rematch_records <- function(x) { structure(x, class = c("rematch_records", "list")) } rematch2/R/all.R0000644000176200001440000000630613225166704013066 0ustar liggesusers #' Extract All Regular Expression Matches Into a Data Frame #' #' This function is a thin wrapper on the \code{\link[base]{gregexpr}} #' base R function, to extract the matching (sub)strings as a data frame. #' It extracts all matches, and potentially their capture groups as well. #' #' @inheritSection re_exec Tidy Data #' #' @note If the input text character vector has length zero, #' \code{\link[base]{regexpr}} is called instead of #' \code{\link[base]{gregexpr}}, because the latter cannot extract the #' number and names of the capture groups in this case. #' #' @param ... Additional arguments to pass to #' \code{\link[base]{gregexpr}} (or \code{\link[base]{regexpr}} if #' \code{text} is of length zero). #' @inheritParams re_match #' @return A tidy data frame (see Section \dQuote{Tidy Data}). The list columns #' contain character vectors with as many entries as there are matches for #' each input element. #' #' @family tidy regular expression matching #' @export #' @examples #' name_rex <- paste0( #' "(?[[:upper:]][[:lower:]]+) ", #' "(?[[:upper:]][[:lower:]]+)" #' ) #' notables <- c( #' " Ben Franklin and Jefferson Davis", #' "\tMillard Fillmore" #' ) #' re_match_all(notables, name_rex) re_match_all <- function(text, pattern, perl=TRUE, ...) { text <- as.character(text) stopifnot(is.character(pattern), length(pattern) == 1, !is.na(pattern)) ## Need to handle this case separately, as gregexpr effectively ## does not work for this. if (length(text) == 0) return(empty_result(text, pattern, perl=perl, ...)) match <- gregexpr(pattern, text, perl=perl, ...) num_groups <- length(attr(match[[1]], "capture.names")) ## Non-matching strings have a rather strange special form, ## so we just treat them differently non <- vapply(match, function(m) m[1] == -1, TRUE) yes <- !non res <- replicate(length(text), list(), simplify = FALSE) if (any(non)) { res[non] <- list(replicate(num_groups + 1, character(), simplify = FALSE)) } if (any(yes)) { res[yes] <- mapply(match1, text[yes], match[yes], SIMPLIFY = FALSE) } ## Need to assemble the final data frame "manually". ## There is apparently no function for this. rbind() is almost ## good, but simplifies to a matrix if the dimensions allow it.... res <- lapply(seq_along(res[[1]]), function(i) { lapply(res, "[[", i) }) res <- structure( res, names = c(attr(match[[1]], "capture.names"), ".match"), row.names = seq_along(text), class = c("tbl_df", "tbl", "data.frame") ) res$.text <- text nc <- ncol(res) res[, c(seq_len(nc - 2), nc, nc - 1)] } match1 <- function(text1, match1) { matchstr <- substring( text1, match1, match1 + attr(match1, "match.length") - 1L ) ## substring fails if the index is length zero, ## need to handle special case if (is.null(attr(match1, "capture.start"))) { list(.match = matchstr) } else { gstart <- attr(match1, "capture.start") glength <- attr(match1, "capture.length") gend <- gstart + glength - 1L groupstr <- substring(text1, gstart, gend) dim(groupstr) <- dim(gstart) c(lapply(seq_len(ncol(groupstr)), function(i) groupstr[, i]), list(.match = matchstr) ) } } rematch2/R/empty_result.R0000644000176200001440000000066113225166704015050 0ustar liggesusers empty_result <- function(text, pattern, perl=TRUE, ...) { match <- regexpr(pattern, text, perl = perl, ...) num_groups <- length(attr(match, "capture.names")) structure( c( replicate(num_groups, list(), simplify = FALSE), list(character()), list(list()) ), names = c(attr(match, "capture.names"), ".text", ".match"), row.names = integer(0), class = c("tbl_df", "tbl", "data.frame") ) } rematch2/R/indexing.R0000644000176200001440000000105713225166704014121 0ustar liggesusers #' @rdname re_exec #' @export $.rematch_records #' @export `$.rematch_records` <- function(x, name) { if (! name %in% c("match", "start", "end")) { stop("'$' match selector must refer to 'match', 'start' or 'end'") } vapply(x, "[[", name, FUN.VALUE = if (name == "match") "" else 1L) } #' @rdname re_exec #' @export $.rematch_allrecords #' @export `$.rematch_allrecords` <- function(x, name) { if (! name %in% c("match", "start", "end")) { stop("'$' match selector must refer to 'match', 'start' or 'end'") } lapply(x, "[[", name) } rematch2/R/bind_re_match.R0000644000176200001440000000367313637633662015111 0ustar liggesusers#' Match results from a data frame column and attach results #' #' Taking a data frame and a column name as input, this function will run #' \code{\link{re_match}} and bind the results as new columns to the original #' table., returning a \code{\link[tibble]{tibble}}. This makes it friendly for #' pipe-oriented programming with \link[magrittr]{magrittr}. #' #' @note If named capture groups will result in multiple columns with the same #' column name, \code{\link[tibble]{repair_names}} will be called on the #' resulting table. #' #' @param df A data frame. #' @param from Name of column to use as input for \code{\link{re_match}}. #' \code{\link{bind_re_match}} takes unquoted names, while #' \code{\link{bind_re_match_}} takes quoted names. #' @param ... Arguments (including \code{pattern}) to pass to #' \code{\link{re_match}}. #' @param keep_match Should the column \code{.match} be included in the results? #' Defaults to \code{FALSE}, to avoid column name collisions in the case that #' \code{\link{bind_re_match}} is called multiple times in succession. #' #' @seealso Standard-evaluation version \code{\link{bind_re_match_}} that is #' suitable for programming. #' #' @examples #' match_cars <- tibble::rownames_to_column(mtcars) #' bind_re_match(match_cars, rowname, "^(?\\w+) ?(?.+)?$") #' #' @export bind_re_match <- function(df, from, ..., keep_match = FALSE) { bind_re_match_(df = df, from = deparse(substitute(from)), ..., keep_match = keep_match) } #' @describeIn bind_re_match Standard-evaluation version that takes a quoted column name. #' @export bind_re_match_ <- function(df, from, ..., keep_match = FALSE) { stopifnot(is.data.frame(df)) if (!tibble::has_name(df, from)) stop(from, " is not present in the data frame.") res <- re_match(text = df[[from]], ...) res <- res[, !names(res) == ".text"] if (!keep_match) { res <- res[, !names(res) == ".match"] } tibble::repair_names(cbind(df, res)) } rematch2/R/package.R0000644000176200001440000000567713652522166013725 0ustar liggesusers #' Match Regular Expressions with a Nicer 'API' #' #' A small wrapper on 'regexpr' to extract the matches and captured #' groups from the match of a regular expression to a character vector. #' See \code{\link{re_match}}. #' #' @importFrom tibble tibble new_tibble "_PACKAGE" #' Extract Regular Expression Matches Into a Data Frame #' #' \code{re_match} wraps \code{\link[base]{regexpr}} and returns the #' match results in a convenient data frame. The data frame has one #' column for each capture group if \code{perl=TRUE}, and one final columns #' called \code{.match} for the matching (sub)string. The columns of the capture #' groups are named if the groups themselves are named. #' #' @note \code{re_match} uses PCRE compatible regular expressions by default #' (i.e. \code{perl = TRUE} in \code{\link[base]{regexpr}}). You can switch #' this off but if you do so capture groups will no longer be reported as they #' are only supported by PCRE. #' #' @param text Character vector. #' @param pattern A regular expression. See \code{\link[base]{regex}} for more #' about regular expressions. #' @param perl logical should perl compatible regular expressions be used? #' Defaults to TRUE, setting to FALSE will disable capture groups. #' @param ... Additional arguments to pass to \code{\link[base]{regexpr}}. #' @return A data frame of character vectors: one column per capture #' group, named if the group was named, and additional columns for #' the input text and the first matching (sub)string. Each row #' corresponds to an element in the \code{text} vector. #' #' @export #' @family tidy regular expression matching #' @examples #' dates <- c("2016-04-20", "1977-08-08", "not a date", "2016", #' "76-03-02", "2012-06-30", "2015-01-21 19:58") #' isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])" #' re_match(text = dates, pattern = isodate) #' #' # The same with named groups #' isodaten <- "(?[0-9]{4})-(?[0-1][0-9])-(?[0-3][0-9])" #' re_match(text = dates, pattern = isodaten) re_match <- function(text, pattern, perl = TRUE, ...) { stopifnot(is.character(pattern), length(pattern) == 1, !is.na(pattern)) text <- as.character(text) match <- regexpr(pattern, text, perl = perl, ...) start <- as.vector(match) length <- attr(match, "match.length") end <- start + length - 1L matchstr <- substring(text, start, end) matchstr[ start == -1 ] <- NA_character_ res <- data.frame( stringsAsFactors = FALSE, .text = text, .match = matchstr ) if (!is.null(attr(match, "capture.start"))) { gstart <- attr(match, "capture.start") glength <- attr(match, "capture.length") gend <- gstart + glength - 1L groupstr <- substring(text, gstart, gend) groupstr[ gstart == -1 ] <- NA_character_ dim(groupstr) <- dim(gstart) res <- cbind(groupstr, res, stringsAsFactors = FALSE) } names(res) <- c(attr(match, "capture.names"), ".text", ".match") class(res) <- c("tbl_df", "tbl", class(res)) res } rematch2/R/exec-all.R0000644000176200001440000000707713637633664014031 0ustar liggesusers#' Extract Data From All Regular Expression Matches Into a Data Frame #' #' @inherit re_exec #' #' @description #' #' Match a regular expression to a string, and return matches, match positions, #' and capture groups. This function is like its #' \code{\link[=re_match_all]{match}} counterpart, except it returns #' match/capture group start and end positions in addition to the matched #' values. #' #' @seealso \code{\link[base]{gregexpr}}, which this function wraps #' @return A tidy data frame (see Section \dQuote{Tidy Data}). The entries #' within the match records within the list columns will be one vectors #' as long as there are matches for the corresponding text element. #' @family tidy regular expression matching #' @export #' @examples #' name_rex <- paste0( #' "(?[[:upper:]][[:lower:]]+) ", #' "(?[[:upper:]][[:lower:]]+)" #' ) #' notables <- c( #' " Ben Franklin and Jefferson Davis", #' "\tMillard Fillmore" #' ) #' # All occurrences #' allpos <- re_exec_all(notables, name_rex) #' allpos #' #' # Custom $ to extract matches and positions #' allpos$first$match #' allpos$first$start #' allpos$first$end re_exec_all <- function(text, pattern, perl = TRUE, ...) { text <- as.character(text) stopifnot(is.character(pattern), length(pattern) == 1, !is.na(pattern)) if (length(text) == 0) { res <- empty_result(text, pattern, perl = perl, ...) for (i in seq_along(res)) { if (is.list(res[[i]])) { res[[i]] <- new_rematch_allrecords(res[[i]]) } } return(res) } match <- gregexpr(pattern, text, perl = perl, ...) rec_names <- c("match", "start", "end") colnames <- c(attr(match[[1]], "capture.names"), ".match") num_groups <- length(colnames) - 1L non_rec <- structure( list(character(0), integer(0), integer(0)), names = rec_names ) ## Non-matching strings have a rather strange special form, ## so we just treat them differently non <- vapply(match, function(m) m[1] == -1, TRUE) yes <- !non res <- replicate(length(text), list(), simplify = FALSE) if (any(non)) { res[non] <- list(replicate(num_groups + 1, non_rec, simplify = FALSE)) } if (any(yes)) { res[yes] <- mapply(exec1, text[yes], match[yes], SIMPLIFY = FALSE) } res <- lapply(seq_along(res[[1]]), function(i) { new_rematch_allrecords(lapply(res, "[[", i)) }) res <- structure( res, names = colnames, row.names = seq_along(text), class = c("tbl_df", "tbl", "data.frame") ) res$.text <- text nc <- ncol(res) res[, c(seq_len(nc - 2), nc, nc - 1)] } exec1 <- function(text1, match1) { start <- as.vector(match1) length <- attr(match1, "match.length") end <- start + length - 1L matchstr <- substring(text1, start, end) matchrec <- list(match = matchstr, start = start, end = end) colnames <- c(attr(match1, "capture.names"), ".match") ## substring fails if the index is length zero, ## need to handle special case res <- if (is.null(attr(match1, "capture.start"))) { replicate(length(colnames), matchrec, simplify = FALSE) } else { gstart <- unname(attr(match1, "capture.start")) glength <- unname(attr(match1, "capture.length")) gend <- gstart + glength - 1L groupstr <- substring(text1, gstart, gend) dim(groupstr) <- dim(gstart) c( lapply( seq_len(ncol(groupstr)), function(i) { list(match = groupstr[, i], start = gstart[, i], end = gend[, i]) } ), list(.match = matchrec) ) } res } new_rematch_allrecords <- function(x) { structure(x, class = c("rematch_allrecords", "list")) } rematch2/NEWS.md0000644000176200001440000000147413652524252013071 0ustar liggesusers # 2.1.2 * rematch2 is now really compatible with both tibble 2.x.y and tibble 3.0.0 (@krlmlr, #12). # 2.1.1 * rematch2 is now compatible with both tibble 2.x.y and tibble 3.0.0 (@krlmlr, #10). # 2.1.0 * Add `bind_re_match()` that reads its input from a column in a data frame and binds the data frame returned by `re_match()` as new columns on the original data frame. # 2.0.1 * Add `perl` argument to `re_match` and `re_match_all` for compatibility with functions that may pass that argument as part of `...` # 2.0.0 * Add `re_match_all` to extract all matches. * Removed the `perl` options, we always use PERL compatible regular expressions. # 1.0.1 * Make `R CMD check` work when `testthat` is not available. * Fixed a bug with group capture when `text` is a scalar. # 1.0.0 First public release. rematch2/MD50000644000176200001440000000247613652743232012307 0ustar liggesusers32c2ecb21ce2d6fdddd1ec9ca61bb299 *DESCRIPTION cb22cf9e08c2ac2d8fa86d6348545e18 *LICENSE e6c8f468697c0625d8c56fe03b7cf1a4 *NAMESPACE 05147e0cdaafc7b9cd55241c39c918e6 *NEWS.md bac091392b33edc38cbcfc4b8dcf9dd8 *R/all.R b73f771d77fee24c64e0e2af6c56a087 *R/bind_re_match.R 8c4723a39e96cdada833c26930af3d52 *R/empty_result.R 31101d2fa3cabf24de0d882d30bc0312 *R/exec-all.R 1ecd692cd0b231f4eca4615ed2972bca *R/exec.R 9ab225156360b0b56ff21d43ff4dcd81 *R/indexing.R c895533f5d6d2b75bf62ab0738a51532 *R/package.R 88b2e652d9d1174b39b446cfff03cf06 *README.md 2db7208e052d885bb89f798945e35a8c *man/bind_re_match.Rd a1a51ce06dfcc7490be66c371c4e8747 *man/re_exec.Rd fb404f8fa9a5eb95ba67d1a549d7e52d *man/re_exec_all.Rd ce6fe351c6b8a4271d379a1b1f88c0eb *man/re_match.Rd 2c4f90655ed88ebc39dfdb8aad1750a4 *man/re_match_all.Rd 29b07ddcaf71c6b436d42b37344a38d0 *man/rematch2-package.Rd 593f32c5e3fb4876c81f4127c5354954 *tests/testthat.R 190da1e3b3f75b71f99de08e1b598810 *tests/testthat/helper.R c460f1143d11c904a47ed1b80d2075c5 *tests/testthat/test-all.R 84f04eaae56fead65f0d48f1d61fb1c0 *tests/testthat/test-bind.R ff6194fc4f5ebf64b47ec8b25cccbb90 *tests/testthat/test-exec-all.R adbe50c5292e8d51da0a937e0e3a5037 *tests/testthat/test-exec.R 3a382258f042452731bee39c151f0ca0 *tests/testthat/test-indexing.R c5a306d4ec0f3bb85d0b2e34d5f49c44 *tests/testthat/test.R