recipes/ 0000755 0001777 0001777 00000000000 13136342173 013241 5 ustar herbrandt herbrandt recipes/inst/ 0000755 0001777 0001777 00000000000 13136242227 014215 5 ustar herbrandt herbrandt recipes/inst/doc/ 0000755 0001777 0001777 00000000000 13136242227 014762 5 ustar herbrandt herbrandt recipes/inst/doc/Selecting_Variables.R 0000644 0001777 0001777 00000002054 13136242227 021013 0 ustar herbrandt herbrandt ## ----ex_setup, include=FALSE--------------------------------------------- knitr::opts_chunk$set( message = FALSE, digits = 3, collapse = TRUE, comment = "#>" ) options(digits = 3) ## ----credit-------------------------------------------------------------- library(recipes) data("credit_data") str(credit_data) rec <- recipe(Status ~ Seniority + Time + Age + Records, data = credit_data) rec ## ----var_info_orig------------------------------------------------------- summary(rec, original = TRUE) ## ----dummy_1------------------------------------------------------------- dummied <- rec %>% step_dummy(all_nominal()) ## ----dummy_2------------------------------------------------------------- dummied <- rec %>% step_dummy(Records) # or dummied <- rec %>% step_dummy(all_nominal(), - Status) # or dummied <- rec %>% step_dummy(all_nominal(), - all_outcomes()) ## ----dummy_3------------------------------------------------------------- dummied <- prep(dummied, training = credit_data) with_dummy <- bake(dummied, newdata = credit_data) with_dummy recipes/inst/doc/Custom_Steps.html 0000644 0001777 0001777 00000706214 13136242227 020312 0 ustar herbrandt herbrandt
recipes
contains a number of different steps included in the package:
library(recipes)
steps <- apropos("^step_")
steps[!grepl("new$", steps)]
#> [1] "step_BoxCox" "step_YeoJohnson" "step_bagimpute"
#> [4] "step_bin2factor" "step_center" "step_classdist"
#> [7] "step_corr" "step_date" "step_depth"
#> [10] "step_discretize" "step_dummy" "step_holiday"
#> [13] "step_hyperbolic" "step_ica" "step_interact"
#> [16] "step_intercept" "step_invlogit" "step_isomap"
#> [19] "step_knnimpute" "step_kpca" "step_lincomb"
#> [22] "step_log" "step_logit" "step_meanimpute"
#> [25] "step_modeimpute" "step_ns" "step_nzv"
#> [28] "step_ordinalscore" "step_other" "step_pca"
#> [31] "step_poly" "step_range" "step_ratio"
#> [34] "step_regex" "step_rm" "step_scale"
#> [37] "step_shuffle" "step_spatialsign" "step_sqrt"
#> [40] "step_window"
You might want to make your own and this page describes how to do that. If you are looking for good examples of existing steps, I would suggest looking at the code for centering or PCA to start.
At an example, let’s create a step that replaces the value of a variable with its percentile from the training set. The date that I’ll use is from the recipes
package:
data(biomass)
str(biomass)
#> 'data.frame': 536 obs. of 8 variables:
#> $ sample : chr "Akhrot Shell" "Alabama Oak Wood Waste" "Alder" "Alfalfa" ...
#> $ dataset : chr "Training" "Training" "Training" "Training" ...
#> $ carbon : num 49.8 49.5 47.8 45.1 46.8 ...
#> $ hydrogen: num 5.64 5.7 5.8 4.97 5.4 5.75 5.99 5.7 5.5 5.9 ...
#> $ oxygen : num 42.9 41.3 46.2 35.6 40.7 ...
#> $ nitrogen: num 0.41 0.2 0.11 3.3 1 2.04 2.68 1.7 0.8 1.2 ...
#> $ sulfur : num 0 0 0.02 0.16 0.02 0.1 0.2 0.2 0 0.1 ...
#> $ HHV : num 20 19.2 18.3 18.2 18.4 ...
biomass_tr <- biomass[biomass$dataset == "Training",]
biomass_te <- biomass[biomass$dataset == "Testing",]
To illustrate the transformation with the carbon
variable, the training set distribution of that variables is shown below with a vertical line for the first value of the test set.
library(ggplot2)
theme_set(theme_bw())
ggplot(biomass_tr, aes(x = carbon)) +
geom_histogram(binwidth = 5, col = "blue", fill = "blue", alpha = .5) +
geom_vline(xintercept = biomass_te$carbon[1], lty = 2)
Based on the training set, 42.1% of the data are less than a value of 46.35. There are some applications where it might be advantageous to represent the predictor values are percentiles rather than their original values.
Our new step will do this computation for any numeric variables of interest. We will call this step_percentile
. The code below is designed for illustration and not speed or best practices. I’ve left out a lot of error trapping that we would want in a real implementation.
The user-exposed function step_percentile
is just a simple wrapper around an internal function called add_step
. This function takes the same arguments as your function and simply adds it to a new recipe. The ...
signfies the variable selectors that can be used.
step_percentile <- function(recipe, ..., role = NA,
trained = FALSE, ref_dist = NULL,
approx = FALSE,
options = list(probs = (0:100)/100, names = TRUE)) {
## bake but do not evaluate the variable selectors with
## the `quos` function in `rlang`
terms <- rlang::quos(...)
if(length(terms) == 0)
stop("Please supply at least one variable specification. See ?selections.")
add_step(
recipe,
step_percentile_new(
terms = terms,
trained = trained,
role = role,
ref_dist = ref_dist,
approx = approx,
options = options))
}
You should always keep the first four arguments (recipe
though trained
) the same as listed above. Some notes:
role
argument is used when you either 1) create new variables and want their role to be pre-set or 2) replace the existing variables with new values. The latter is what we will be doing and using role = NA
will leave the existing role intact.trained
is set by the package when the estimation step has been run. You should default your function definition’s argument to FALSE
.I’ve added extra arguments specific to this step. In order to calculate the percentile, the training data for the relevant columns will need to be saved. This data will be saved in the ref_dist
object. However, this might be problematic if the data set is large. approx
would be used when you want to save a grid of pre-computed percentiles from the training set and use these to estimate the percentile for a new data point. If approx = TRUE
, the argument ref_dist
will contain the grid for each variable.
We will use the stats::quantile
to compute the grid. However, we might also want to have control over the granularity of this grid, so the options
argument will be used to define how that calculations is done. We could just use the ellipses (aka ...
) so that any options passed to step_percentile
that are not one of its arguments will then be passed to stats::quantile
. We recommend making a seperate list object with the options and use these inside the function.
Next, you can utilize the internal function step
that sets the class of new objects. Using subclass = "percentile"
will set the class of new objects to `“step_percentile”.
step_percentile_new <- function(terms = NULL, role = NA, trained = FALSE,
ref_dist = NULL, approx = NULL, options = NULL) {
step(
subclass = "percentile",
terms = terms,
role = role,
trained = trained,
ref_dist = ref_dist,
approx = approx,
options = options
)
}
You will need to create a new prep
method for your step’s class. To do this, three arguments that the method should have:
function(x, training, info = NULL)
where
x
will be the step_percentile
objecttraining
will be a tibble that has the training set datainfo
will also be a tibble that has information on the current set of data available. This information is updated as each step is evaluated by its specific prep
method so it may not have the variables from the original data. The columns in this tibble are variable
(the variable name), type
(currently either “numeric” or “nominal”), role
(defining the variable’s role), and source
(either “original” or “derived” depending on where it originated).You can define other options.
The first thing that you might want to do in the prep
function is to translate the specification listed in the terms
argument to column names in the current data. There is an internal function called terms_select
that can be used to obtain this.
prep.step_percentile <- function(x, training, info = NULL, ...) {
col_names <- terms_select(terms = x$terms, info = info)
}
Once we have this, we can either save the original data columns or estimate the approximation grid. For the grid, we will use a helper function that enables us to run do.call
on a list of arguments that include the options
list.
get_pctl <- function(x, args) {
args$x <- x
do.call("quantile", args)
}
prep.step_percentile <- function(x, training, info = NULL, ...) {
col_names <- terms_select(terms = x$terms, info = info)
## You can add error trapping for non-numeric data here and so on.
## We'll use the names later so
if(x$options$names == FALSE)
stop("`names` should be set to TRUE")
if(!x$approx) {
x$ref_dist <- training[, col_names]
} else {
pctl <- lapply(
training[, col_names],
get_pctl,
args = x$options
)
x$ref_dist <- pctl
}
## Always return the updated step
x
}
bake
methodRemember that the prep
function does not apply the step to the data; it only estimates any required values such as ref_dist
. We will need to create a new method for our step_percentile
class. The minimum arguments for this are
function(object, newdata, ...)
where object
is the updated step function that has been through the corresponding prep
code and newdata
is a tibble of data to be preprocessingcessed.
Here is the code to convert the new data to percentiles. Two initial helper functions handle the two cases (approximation or not). We always return a tibble as the output.
## Two helper functions
pctl_by_mean <- function(x, ref) mean(ref <= x)
pctl_by_approx <- function(x, ref) {
## go from 1 column tibble to vector
x <- getElement(x, names(x))
## get the percentiles values from the names (e.g. "10%")
p_grid <- as.numeric(gsub("%$", "", names(ref)))
approx(x = ref, y = p_grid, xout = x)$y/100
}
bake.step_percentile <- function(object, newdata, ...) {
require(tibble)
## For illustration (and not speed), we will loop through the affected variables
## and do the computations
vars <- names(object$ref_dist)
for(i in vars) {
if(!object$approx) {
## We can use `apply` since tibbles do not drop dimensions:
newdata[, i] <- apply(newdata[, i], 1, pctl_by_mean,
ref = object$ref_dist[, i])
} else
newdata[, i] <- pctl_by_approx(newdata[, i], object$ref_dist[[i]])
}
## Always convert to tibbles on the way out
as_tibble(newdata)
}
Let’s use the example data to make sure that it works:
rec_obj <- recipe(HHV ~ ., data = biomass_tr[, -(1:2)])
rec_obj <- rec_obj %>%
step_percentile(all_predictors(), approx = TRUE)
rec_obj <- prep(rec_obj, training = biomass_tr)
#> step 1 percentile training
percentiles <- bake(rec_obj, biomass_te)
percentiles
#> # A tibble: 80 x 5
#> carbon hydrogen oxygen nitrogen sulfur
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.4209 0.4500 0.9026 0.215 0.735
#> 2 0.1800 0.3850 0.9217 0.928 0.839
#> 3 0.1561 0.3850 0.9447 0.900 0.805
#> 4 0.4233 0.7750 0.2800 0.845 0.902
#> 5 0.6662 0.8667 0.6314 0.155 0.090
#> 6 0.2175 0.3850 0.5363 0.495 0.700
#> 7 0.0803 0.2713 0.9859 0.695 0.903
#> 8 0.1395 0.1260 0.1604 0.606 0.700
#> 9 0.0226 0.1035 0.1312 0.126 0.996
#> 10 0.0178 0.0821 0.0987 0.972 0.974
#> # ... with 70 more rows
The plot below shows how the original data line up with the percentiles for each split of the data for one of the predictors:
When recipe steps are used, there are different approaches that can be used to select which variables or features should be used.
The three main characteristics of variables that can be queried:
The manual pages for ?selections
and ?has_role
have details about the available selection methods.
To illustrate this, the credit data will be used:
library(recipes)
data("credit_data")
str(credit_data)
#> 'data.frame': 4454 obs. of 14 variables:
#> $ Status : Factor w/ 2 levels "bad","good": 2 2 1 2 2 2 2 2 2 1 ...
#> $ Seniority: int 9 17 10 0 0 1 29 9 0 0 ...
#> $ Home : Factor w/ 6 levels "ignore","other",..: 6 6 3 6 6 3 3 4 3 4 ...
#> $ Time : int 60 60 36 60 36 60 60 12 60 48 ...
#> $ Age : int 30 58 46 24 26 36 44 27 32 41 ...
#> $ Marital : Factor w/ 5 levels "divorced","married",..: 2 5 2 4 4 2 2 4 2 2 ...
#> $ Records : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 1 1 ...
#> $ Job : Factor w/ 4 levels "fixed","freelance",..: 2 1 2 1 1 1 1 1 2 4 ...
#> $ Expenses : int 73 48 90 63 46 75 75 35 90 90 ...
#> $ Income : int 129 131 200 182 107 214 125 80 107 80 ...
#> $ Assets : int 0 0 3000 2500 0 3500 10000 0 15000 0 ...
#> $ Debt : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ Amount : int 800 1000 2000 900 310 650 1600 200 1200 1200 ...
#> $ Price : int 846 1658 2985 1325 910 1645 1800 1093 1957 1468 ...
rec <- recipe(Status ~ Seniority + Time + Age + Records, data = credit_data)
rec
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 4
Before any steps are used the information on the original variables is:
summary(rec, original = TRUE)
#> # A tibble: 5 x 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 Seniority numeric predictor original
#> 2 Time numeric predictor original
#> 3 Age numeric predictor original
#> 4 Records nominal predictor original
#> 5 Status nominal outcome original
We can add a step to compute dummy variables on the non-numeric data after we impute any missing data:
dummied <- rec %>% step_dummy(all_nominal())
This will capture any variables that are either character strings or factors: Status
and Records
. However, since Status
is our outcome, we might want to keep it as a factor so we can subtract that variable out either by name or by role:
dummied <- rec %>% step_dummy(Records) # or
dummied <- rec %>% step_dummy(all_nominal(), - Status) # or
dummied <- rec %>% step_dummy(all_nominal(), - all_outcomes())
Using the last definition:
dummied <- prep(dummied, training = credit_data)
#> step 1 dummy training
with_dummy <- bake(dummied, newdata = credit_data)
with_dummy
#> # A tibble: 4,454 x 4
#> Seniority Time Age Records_yes
#> <int> <int> <int> <dbl>
#> 1 9 60 30 0
#> 2 17 60 58 0
#> 3 10 36 46 1
#> 4 0 60 24 0
#> 5 0 36 26 0
#> 6 1 60 36 0
#> 7 29 60 44 0
#> 8 9 12 27 0
#> 9 0 60 32 0
#> 10 0 48 41 0
#> # ... with 4,444 more rows
Status
is unaffected.
One important aspect about selecting variables in steps is that the variable names and types may change as steps are being executed. In the above example, Records
is a factor variable before the step is executed. Afterwards, Records
is gone and the binary variable Records_yes
is in its place. One reason to have general selection routines like all_predictors
or contains
is to be able to select variables that have not be created yet.
This document demonstrates some basic uses of recipes. First, some definitions are required:
Y ~ A + B + A:B
, the variables are A
, B
, and Y
.predictor
(independent variables), response
, and case weight
. This is meant to be open-ended and extensible.A
, B
, and A:B
. These can be other derived entities that are grouped such a a set of principal components or a set of columns that define a basis function for a variable. These are synonymous with features in machine learning. Variables that have predictor
roles would automatically be main effect termsThe cell segmentation data will be used. It has 58 predictor columns, a factor variable Class
(the outcome), and two extra labelling columns. Each of the predictors has a suffix for the optical channel ("Ch1"
-"Ch4"
). We will first separate the data into a training and test set then remove unimportant variables:
library(recipes)
library(caret)
data(segmentationData)
seg_train <- segmentationData %>%
filter(Case == "Train") %>%
select(-Case, -Cell)
seg_test <- segmentationData %>%
filter(Case == "Test") %>%
select(-Case, -Cell)
The idea is that the preprocessing operations will all be created using the training set and then these steps will be applied to both the training and test set.
For a first recipe, let’s plan on centering and scaling the predictors. First, we will create a recipe from the original data and then specify the processing steps.
Recipes can be created manually by sequentially adding roles to variables in a data set.
If the analysis only required outcomes and predictors, the easiest way to create the initial recipe is to use the standard formula method:
rec_obj <- recipe(Class ~ ., data = seg_train)
rec_obj
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 58
The data contained in the data
argument need not be the training set; this data is only used to catalog the names of the variables and their types (e.g. numeric, etc.).
(Note that the formula method here is used to declare the variables and their roles and nothing else. If you use inline functions (e.g. log
) it will complain. These types of operations can be added later.)
From here, preprocessing steps can be added sequentially in one of two ways:
rec_obj <- step_name(rec_obj, arguments) ## or
rec_obj <- rec_obj %>% step_name(arguments)
step_center
and the other functions will always return updated recipes.
One other important facet of the code is the method for specifying which variables should be used in different steps. The manual page ?selections
has more details but dplyr
-like selector functions can be used:
x1, x2
),dplyr
functions for selecting variables: contains
, ends_with
, everything
, matches
, num_range
, and starts_with
,all_outcomes
, all_predictors
, has_role
, orall_nominal
, all_numeric
, and has_type
.Note that the functions listed above are the only ones that can be used to selecto variables inside the steps. Also, minus signs can be used to deselect variables.
For our data, we can add the two operations for all of the predictors:
standardized <- rec_obj %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
standardized
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 58
#>
#> Steps:
#>
#> Centering for all_predictors()
#> Scaling for all_predictors()
It is important to realize that the specific variables have not been declared yet (in this example). In some preprocessing steps, variables will be added or removed from the current list of possible variables.
If this is the only preprocessing steps for the predictors, we can now estimate the means and standard deviations from the training set. The prep
function is used with a recipe and a data set:
trained_rec <- prep(standardized, training = seg_train)
#> step 1 center training
#> step 2 scale training
Now that the statistics have been estimated, the preprocessing can be applied to the training and test set:
train_data <- bake(trained_rec, newdata = seg_train)
test_data <- bake(trained_rec, newdata = seg_test)
bake
returns a tibble:
class(test_data)
#> [1] "tbl_df" "tbl" "data.frame"
test_data
#> # A tibble: 1,010 x 58
#> AngleCh1 AreaCh1 AvgIntenCh1 AvgIntenCh2 AvgIntenCh3 AvgIntenCh4
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1.0656 -0.647 -0.684 -1.177 -0.926 -0.9238
#> 2 -1.8040 -0.185 -0.632 -0.479 -0.809 -0.6666
#> 3 -1.0300 -0.707 1.207 3.035 0.348 1.3864
#> 4 1.6935 -0.684 0.806 2.664 0.296 0.8934
#> 5 1.8129 -0.342 -0.668 -1.172 -0.843 -0.9282
#> 6 -1.4759 0.784 -0.682 -0.628 -0.881 -0.5939
#> 7 1.2702 0.272 -0.672 -0.625 -0.809 -0.5156
#> 8 -1.5837 0.457 0.283 1.320 -0.613 -0.0891
#> 9 -0.7957 -0.412 -0.669 -1.168 -0.845 -0.9258
#> 10 0.0363 -0.638 -0.535 0.182 -0.555 -0.0253
#> # ... with 1,000 more rows, and 52 more variables:
#> # ConvexHullAreaRatioCh1 <dbl>, ConvexHullPerimRatioCh1 <dbl>,
#> # DiffIntenDensityCh1 <dbl>, DiffIntenDensityCh3 <dbl>,
#> # DiffIntenDensityCh4 <dbl>, EntropyIntenCh1 <dbl>,
#> # EntropyIntenCh3 <dbl>, EntropyIntenCh4 <dbl>, EqCircDiamCh1 <dbl>,
#> # EqEllipseLWRCh1 <dbl>, EqEllipseOblateVolCh1 <dbl>,
#> # EqEllipseProlateVolCh1 <dbl>, EqSphereAreaCh1 <dbl>,
#> # EqSphereVolCh1 <dbl>, FiberAlign2Ch3 <dbl>, FiberAlign2Ch4 <dbl>,
#> # FiberLengthCh1 <dbl>, FiberWidthCh1 <dbl>, IntenCoocASMCh3 <dbl>,
#> # IntenCoocASMCh4 <dbl>, IntenCoocContrastCh3 <dbl>,
#> # IntenCoocContrastCh4 <dbl>, IntenCoocEntropyCh3 <dbl>,
#> # IntenCoocEntropyCh4 <dbl>, IntenCoocMaxCh3 <dbl>,
#> # IntenCoocMaxCh4 <dbl>, KurtIntenCh1 <dbl>, KurtIntenCh3 <dbl>,
#> # KurtIntenCh4 <dbl>, LengthCh1 <dbl>, NeighborAvgDistCh1 <dbl>,
#> # NeighborMinDistCh1 <dbl>, NeighborVarDistCh1 <dbl>, PerimCh1 <dbl>,
#> # ShapeBFRCh1 <dbl>, ShapeLWRCh1 <dbl>, ShapeP2ACh1 <dbl>,
#> # SkewIntenCh1 <dbl>, SkewIntenCh3 <dbl>, SkewIntenCh4 <dbl>,
#> # SpotFiberCountCh3 <dbl>, SpotFiberCountCh4 <dbl>, TotalIntenCh1 <dbl>,
#> # TotalIntenCh2 <dbl>, TotalIntenCh3 <dbl>, TotalIntenCh4 <dbl>,
#> # VarIntenCh1 <dbl>, VarIntenCh3 <dbl>, VarIntenCh4 <dbl>,
#> # WidthCh1 <dbl>, XCentroid <dbl>, YCentroid <dbl>
After exploring the data, more preprocessing might be required. Steps can be added to the trained recipe. Suppose that we need to create PCA components but only from the predictors from channel 1 and any predictors that are areas:
trained_rec <- trained_rec %>%
step_pca(ends_with("Ch1"), contains("area"), num = 5)
trained_rec
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 58
#>
#> Training data contained 1009 data points and no missing data.
#>
#> Steps:
#>
#> Centering for AngleCh1, AreaCh1, ... [trained]
#> Scaling for AngleCh1, AreaCh1, ... [trained]
#> PCA extraction with ends_with("Ch1"), contains("area")
Note that only the last step has been estimated; the first two were previously trained and these activities are not duplicated. We can add the PCA estimates using prep
again:
trained_rec <- prep(trained_rec, training = seg_train)
#> step 1 center [pre-trained]
#> step 2 scale [pre-trained]
#> step 3 pca training
bake
can be reapplied to get the principal components in addition to the other variables:
test_data <- bake(trained_rec, newdata = seg_test)
names(test_data)
#> [1] "AvgIntenCh2" "AvgIntenCh3" "AvgIntenCh4"
#> [4] "DiffIntenDensityCh3" "DiffIntenDensityCh4" "EntropyIntenCh3"
#> [7] "EntropyIntenCh4" "FiberAlign2Ch3" "FiberAlign2Ch4"
#> [10] "IntenCoocASMCh3" "IntenCoocASMCh4" "IntenCoocContrastCh3"
#> [13] "IntenCoocContrastCh4" "IntenCoocEntropyCh3" "IntenCoocEntropyCh4"
#> [16] "IntenCoocMaxCh3" "IntenCoocMaxCh4" "KurtIntenCh3"
#> [19] "KurtIntenCh4" "SkewIntenCh3" "SkewIntenCh4"
#> [22] "SpotFiberCountCh3" "SpotFiberCountCh4" "TotalIntenCh2"
#> [25] "TotalIntenCh3" "TotalIntenCh4" "VarIntenCh3"
#> [28] "VarIntenCh4" "XCentroid" "YCentroid"
#> [31] "PC1" "PC2" "PC3"
#> [34] "PC4" "PC5"
Note that the PCA components have replaced the original variables that were from channel 1 or measured an area aspect of the cells.
There are a number of different steps included in the package:
steps <- apropos("^step_")
steps[!grepl("new$", steps)]
#> [1] "step_BoxCox" "step_YeoJohnson" "step_bagimpute"
#> [4] "step_bin2factor" "step_center" "step_classdist"
#> [7] "step_corr" "step_date" "step_depth"
#> [10] "step_discretize" "step_dummy" "step_holiday"
#> [13] "step_hyperbolic" "step_ica" "step_interact"
#> [16] "step_intercept" "step_invlogit" "step_isomap"
#> [19] "step_knnimpute" "step_kpca" "step_lincomb"
#> [22] "step_log" "step_logit" "step_meanimpute"
#> [25] "step_modeimpute" "step_ns" "step_nzv"
#> [28] "step_ordinalscore" "step_other" "step_pca"
#> [31] "step_percentile" "step_poly" "step_range"
#> [34] "step_ratio" "step_regex" "step_rm"
#> [37] "step_scale" "step_shuffle" "step_spatialsign"
#> [40] "step_sqrt" "step_window"
In recipes, there are no constraints related to the order in which steps are added to the recipe. However, there are some general suggestions that you should consider:
step_dummy
first so that numeric columns are in the data set instead of factors.step_interact
, you should make dummy variables before creating the interactions.step_other
, call step_other
before step_dummy
.While your project’s needs may vary, here is a suggested order of potential steps that should work for most problems:
Again, your milage may vary for your particular problem.
recipes/inst/doc/Custom_Steps.Rmd 0000644 0001777 0001777 00000023266 13136242227 020067 0 ustar herbrandt herbrandt --- title: "Creating Custom Step Functions" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Custom Steps} %\VignetteEncoding{UTF-8} output: knitr:::html_vignette: toc: yes --- ```{r ex_setup, include=FALSE} knitr::opts_chunk$set( message = FALSE, digits = 3, collapse = TRUE, comment = "#>" ) options(digits = 3) ``` `recipes` contains a number of different steps included in the package: ```{r step_list} library(recipes) steps <- apropos("^step_") steps[!grepl("new$", steps)] ``` You might want to make your own and this page describes how to do that. If you are looking for good examples of existing steps, I would suggest looking at the code for [centering](https://github.com/topepo/recipes/blob/master/R/center.R) or [PCA](https://github.com/topepo/recipes/blob/master/R/pca.R) to start. # A new step definition At an example, let's create a step that replaces the value of a variable with its percentile from the training set. The date that I'll use is from the `recipes` package: ```{r initial} data(biomass) str(biomass) biomass_tr <- biomass[biomass$dataset == "Training",] biomass_te <- biomass[biomass$dataset == "Testing",] ``` To illustrate the transformation with the `carbon` variable, the training set distribution of that variables is shown below with a vertical line for the first value of the test set. ```{r carbon_dist} library(ggplot2) theme_set(theme_bw()) ggplot(biomass_tr, aes(x = carbon)) + geom_histogram(binwidth = 5, col = "blue", fill = "blue", alpha = .5) + geom_vline(xintercept = biomass_te$carbon[1], lty = 2) ``` Based on the training set, `r round(mean(biomass_tr$carbon <= biomass_te$carbon[1])*100, 1)`% of the data are less than a value of `r biomass_te$carbon[1]`. There are some applications where it might be advantageous to represent the predictor values are percentiles rather than their original values. Our new step will do this computation for any numeric variables of interest. We will call this `step_percentile`. The code below is designed for illustration and not speed or best practices. I've left out a lot of error trapping that we would want in a real implementation. # Create the initial function. The user-exposed function `step_percentile` is just a simple wrapper around an internal function called `add_step`. This function takes the same arguments as your function and simply adds it to a new recipe. The `...` signfies the variable selectors that can be used. ```{r initial_def} step_percentile <- function(recipe, ..., role = NA, trained = FALSE, ref_dist = NULL, approx = FALSE, options = list(probs = (0:100)/100, names = TRUE)) { ## bake but do not evaluate the variable selectors with ## the `quos` function in `rlang` terms <- rlang::quos(...) if(length(terms) == 0) stop("Please supply at least one variable specification. See ?selections.") add_step( recipe, step_percentile_new( terms = terms, trained = trained, role = role, ref_dist = ref_dist, approx = approx, options = options)) } ``` You should always keep the first four arguments (`recipe` though `trained`) the same as listed above. Some notes: * the `role` argument is used when you either 1) create new variables and want their role to be pre-set or 2) replace the existing variables with new values. The latter is what we will be doing and using `role = NA` will leave the existing role intact. * `trained` is set by the package when the estimation step has been run. You should default your function definition's argument to `FALSE`. I've added extra arguments specific to this step. In order to calculate the percentile, the training data for the relevant columns will need to be saved. This data will be saved in the `ref_dist` object. However, this might be problematic if the data set is large. `approx` would be used when you want to save a grid of pre-computed percentiles from the training set and use these to estimate the percentile for a new data point. If `approx = TRUE`, the argument `ref_dist` will contain the grid for each variable. We will use the `stats::quantile` to compute the grid. However, we might also want to have control over the granularity of this grid, so the `options` argument will be used to define how that calculations is done. We could just use the ellipses (aka `...`) so that any options passed to `step_percentile` that are not one of its arguments will then be passed to `stats::quantile`. We recommend making a seperate list object with the options and use these inside the function. # Initialization of new objects Next, you can utilize the internal function `step` that sets the class of new objects. Using `subclass = "percentile"` will set the class of new objects to `"step_percentile". ```{r initialize} step_percentile_new <- function(terms = NULL, role = NA, trained = FALSE, ref_dist = NULL, approx = NULL, options = NULL) { step( subclass = "percentile", terms = terms, role = role, trained = trained, ref_dist = ref_dist, approx = approx, options = options ) } ``` # Define the estimation procedure You will need to create a new `prep` method for your step's class. To do this, three arguments that the method should have: ```r function(x, training, info = NULL) ``` where * `x` will be the `step_percentile` object * `training` will be a _tibble_ that has the training set data * `info` will also be a tibble that has information on the current set of data available. This information is updated as each step is evaluated by its specific `prep` method so it may not have the variables from the original data. The columns in this tibble are `variable` (the variable name), `type` (currently either "numeric" or "nominal"), `role` (defining the variable's role), and `source` (either "original" or "derived" depending on where it originated). You can define other options. The first thing that you might want to do in the `prep` function is to translate the specification listed in the `terms` argument to column names in the current data. There is an internal function called `terms_select` that can be used to obtain this. ```{r prep_1, eval = FALSE} prep.step_percentile <- function(x, training, info = NULL, ...) { col_names <- terms_select(terms = x$terms, info = info) } ``` Once we have this, we can either save the original data columns or estimate the approximation grid. For the grid, we will use a helper function that enables us to run `do.call` on a list of arguments that include the `options` list. ```{r prep_2} get_pctl <- function(x, args) { args$x <- x do.call("quantile", args) } prep.step_percentile <- function(x, training, info = NULL, ...) { col_names <- terms_select(terms = x$terms, info = info) ## You can add error trapping for non-numeric data here and so on. ## We'll use the names later so if(x$options$names == FALSE) stop("`names` should be set to TRUE") if(!x$approx) { x$ref_dist <- training[, col_names] } else { pctl <- lapply( training[, col_names], get_pctl, args = x$options ) x$ref_dist <- pctl } ## Always return the updated step x } ``` # Create the `bake` method Remember that the `prep` function does not _apply_ the step to the data; it only estimates any required values such as `ref_dist`. We will need to create a new method for our `step_percentile` class. The minimum arguments for this are ```r function(object, newdata, ...) ``` where `object` is the updated step function that has been through the corresponding `prep` code and `newdata` is a tibble of data to be preprocessingcessed. Here is the code to convert the new data to percentiles. Two initial helper functions handle the two cases (approximation or not). We always return a tibble as the output. ```{r bake} ## Two helper functions pctl_by_mean <- function(x, ref) mean(ref <= x) pctl_by_approx <- function(x, ref) { ## go from 1 column tibble to vector x <- getElement(x, names(x)) ## get the percentiles values from the names (e.g. "10%") p_grid <- as.numeric(gsub("%$", "", names(ref))) approx(x = ref, y = p_grid, xout = x)$y/100 } bake.step_percentile <- function(object, newdata, ...) { require(tibble) ## For illustration (and not speed), we will loop through the affected variables ## and do the computations vars <- names(object$ref_dist) for(i in vars) { if(!object$approx) { ## We can use `apply` since tibbles do not drop dimensions: newdata[, i] <- apply(newdata[, i], 1, pctl_by_mean, ref = object$ref_dist[, i]) } else newdata[, i] <- pctl_by_approx(newdata[, i], object$ref_dist[[i]]) } ## Always convert to tibbles on the way out as_tibble(newdata) } ``` # Running the example Let's use the example data to make sure that it works: ```{r example} rec_obj <- recipe(HHV ~ ., data = biomass_tr[, -(1:2)]) rec_obj <- rec_obj %>% step_percentile(all_predictors(), approx = TRUE) rec_obj <- prep(rec_obj, training = biomass_tr) percentiles <- bake(rec_obj, biomass_te) percentiles ``` The plot below shows how the original data line up with the percentiles for each split of the data for one of the predictors: ```{r cdf_plot, echo = FALSE} grid_pct <- rec_obj$steps[[1]]$options$probs plot_data <- data.frame( carbon = c( quantile(biomass_tr$carbon, probs = grid_pct), biomass_te$carbon ), percentile = c(grid_pct, percentiles$carbon), dataset = rep( c("Training", "Testing"), c(length(grid_pct), nrow(percentiles)) ) ) ggplot(plot_data, aes(x = carbon, y = percentile, col = dataset)) + geom_point(alpha = .4, cex = 2) + theme(legend.position = "top") ``` recipes/tests/ 0000755 0001777 0001777 00000000000 13136242227 014402 5 ustar herbrandt herbrandt recipes/tests/testthat.R 0000644 0001777 0001777 00000000115 13064546045 016367 0 ustar herbrandt herbrandt library(testthat) library(recipes) test_check(package = "recipes") q("no") recipes/tests/testthat/ 0000755 0001777 0001777 00000000000 13136342173 016243 5 ustar herbrandt herbrandt recipes/tests/testthat/test_spatialsign.R 0000644 0001777 0001777 00000001647 13135741217 021754 0 ustar herbrandt herbrandt library(testthat) library(recipes) data("biomass") rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass) test_that('spatial sign', { sp_sign <- rec %>% step_center(carbon, hydrogen) %>% step_scale(carbon, hydrogen) %>% step_spatialsign(carbon, hydrogen) sp_sign_trained <- prep(sp_sign, training = biomass, verbose = FALSE) sp_sign_pred <- bake(sp_sign_trained, newdata = biomass) sp_sign_pred <- as.matrix(sp_sign_pred)[, c("carbon", "hydrogen")] x <- as.matrix(scale(biomass[, 3:4], center = TRUE, scale = TRUE)) x <- t(apply(x, 1, function(x) x/sqrt(sum(x^2)))) expect_equal(sp_sign_pred, x) }) test_that('printing', { sp_sign <- rec %>% step_center(carbon, hydrogen) %>% step_scale(carbon, hydrogen) %>% step_spatialsign(carbon, hydrogen) expect_output(print(sp_sign)) expect_output(prep(sp_sign, training = biomass)) }) recipes/tests/testthat/test_rm.R 0000644 0001777 0001777 00000001311 13135741217 020040 0 ustar herbrandt herbrandt library(testthat) library(recipes) library(tibble) n <- 20 set.seed(12) ex_dat <- data.frame(x1 = rnorm(n), x2 = runif(n)) test_that('simple logit trans', { rec <- recipe(~., data = ex_dat) %>% step_rm(x1) rec_trained <- prep(rec, training = ex_dat, verbose = FALSE) rec_rm <- bake(rec_trained, newdata = ex_dat) expect_equal(colnames(rec_rm), "x2") }) test_that('printing', { rec <- recipe(~., data = ex_dat) %>% step_rm(x1) expect_output(print(rec)) expect_output(prep(rec, training = ex_dat)) }) test_that('printing', { rec <- recipe(~., data = ex_dat) %>% step_rm(x1) expect_output(print(rec)) expect_output(prep(rec, training = ex_dat)) }) recipes/tests/testthat/test_YeoJohnson.R 0000644 0001777 0001777 00000006117 13135741217 021526 0 ustar herbrandt herbrandt library(testthat) library(recipes) n <- 20 set.seed(1) ex_dat <- data.frame(x1 = exp(rnorm(n, mean = .1)), x2 = 1/rnorm(n), x3 = rep(1:2, each = n/2), x4 = rexp(n)) ## from `car` package exp_lambda <- c(x1 = -0.2727204451, x2 = 1.139292543, x3 = NA, x4 = -1.012702061) exp_dat <- structure(list(x1 = c(0.435993557749438, 0.754696454247318, 0.371327932207827, 1.46113017436327, 0.82204097731098, 0.375761562702297, 0.89751975937422, 1.02175936118846, 0.940739811377902, 0.54984302797741, 1.41856737837093, 0.850587387615876, 0.437701618670981, 0.112174615510591, 1.21942112715274, 0.654589551748501, 0.666780580127795, 1.12625135443351, 1.0636850911955, 0.949680956411546), x2 = c(1.15307873387121, 1.36532999080347, 17.4648439780388, -0.487746797875704, 1.74452440065935, -13.3640721541574, -5.35805967319061, -0.653901985285932, -1.90735599477338, 2.65253432454371, 0.76771137336975, -7.79484535687973, 2.87484976680907, -13.8738947581599, -0.696856395842167, -2.17745353101028, -2.28384276604207, -12.7261652971783, 0.95585544349634, 1.40099012093008), x3 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), x4 = c(0.49061104973894, 0.49670370366879, 0.338742419511653, 0.663722100577351, 0.296260662322359, 0.681346128666408, 0.757581280603711, 0.357148961119583, 0.371872889850153, 0.49239057672598, 0.173259524331095, 0.235933290139909, 0.52297977893566, 0.434927187456966, 0.0822501770191215, 0.523479652016858, 0.197977570919824, 0.608108816144845, 0.821913792446345, 0.300608495427594)), .Names = c("x1", "x2", "x3", "x4"), row.names = c(NA, -20L), class = "data.frame") test_that('simple YJ trans', { rec <- recipe(~., data = ex_dat) %>% step_YeoJohnson(x1, x2, x3, x4) rec_trained <- prep(rec, training = ex_dat, verbose = FALSE) rec_trans <- bake(rec_trained, newdata = ex_dat) expect_equal(names(exp_lambda)[!is.na(exp_lambda)], names(rec_trained$steps[[1]]$lambdas)) expect_equal(exp_lambda[!is.na(exp_lambda)], rec_trained$steps[[1]]$lambdas, tol = .001) expect_equal(as.matrix(exp_dat), as.matrix(rec_trans), tol = .05) }) test_that('printing', { rec <- recipe(~., data = ex_dat) %>% step_YeoJohnson(x1, x2, x3, x4) expect_output(print(rec)) expect_output(prep(rec, training = ex_dat)) }) recipes/tests/testthat/test_ordinalscore.R 0000644 0001777 0001777 00000004400 13135741217 022110 0 ustar herbrandt herbrandt library(testthat) library(recipes) n <- 20 set.seed(752) ex_dat <- data.frame( numbers = rnorm(n), fact = factor(sample(letters[1:3], n, replace = TRUE)), ord1 = factor(sample(LETTERS[1:3], n, replace = TRUE), ordered = TRUE), ord2 = factor(sample(LETTERS[4:8], n, replace = TRUE), ordered = TRUE), ord3 = factor(sample(LETTERS[10:20], n, replace = TRUE), ordered = TRUE) ) ex_miss <- ex_dat ex_miss$ord1[c(1, 5, 9)] <- NA ex_miss$ord3[2] <- NA score <- function(x) as.numeric(x)^2 test_that('linear scores', { rec1 <- recipe(~ ., data = ex_dat) %>% step_ordinalscore(starts_with("ord")) rec1 <- prep(rec1, training = ex_dat, retain = TRUE, stringsAsFactors = FALSE, verbose = FALSE) rec1_scores <- bake(rec1, newdata = ex_dat) rec1_scores_NA <- bake(rec1, newdata = ex_miss) expect_equal(as.numeric(ex_dat$ord1), rec1_scores$ord1) expect_equal(as.numeric(ex_dat$ord2), rec1_scores$ord2) expect_equal(as.numeric(ex_dat$ord3), rec1_scores$ord3) expect_equal(as.numeric(ex_miss$ord1), rec1_scores_NA$ord1) expect_equal(as.numeric(ex_miss$ord3), rec1_scores_NA$ord3) }) test_that('nonlinear scores', { rec2 <- recipe(~ ., data = ex_dat) %>% step_ordinalscore(starts_with("ord"), convert = score) rec2 <- prep(rec2, training = ex_dat, retain = TRUE, stringsAsFactors = FALSE, verbose = FALSE) rec2_scores <- bake(rec2, newdata = ex_dat) rec2_scores_NA <- bake(rec2, newdata = ex_miss) expect_equal(as.numeric(ex_dat$ord1)^2, rec2_scores$ord1) expect_equal(as.numeric(ex_dat$ord2)^2, rec2_scores$ord2) expect_equal(as.numeric(ex_dat$ord3)^2, rec2_scores$ord3) expect_equal(as.numeric(ex_miss$ord1)^2, rec2_scores_NA$ord1) expect_equal(as.numeric(ex_miss$ord3)^2, rec2_scores_NA$ord3) }) test_that('bad spec', { rec3 <- recipe(~ ., data = ex_dat) %>% step_ordinalscore(everything()) expect_error(prep(rec3, training = ex_dat, verbose = FALSE)) rec4 <- recipe(~ ., data = ex_dat) expect_error(rec4 %>% step_ordinalscore()) }) test_that('printing', { rec5 <- recipe(~ ., data = ex_dat) %>% step_ordinalscore(starts_with("ord")) expect_output(print(rec5)) expect_output(prep(rec5, training = ex_dat)) }) recipes/tests/testthat/test_other.R 0000644 0001777 0001777 00000011720 13135741217 020550 0 ustar herbrandt herbrandt library(testthat) library(recipes) data(okc) set.seed(19) in_train <- sample(1:nrow(okc), size = 30000) okc_tr <- okc[ in_train,] okc_te <- okc[-in_train,] rec <- recipe(~ diet + location, data = okc_tr) test_that('default inputs', { others <- rec %>% step_other(diet, location) others <- prep(others, training = okc_tr) others_te <- bake(others, newdata = okc_te) diet_props <- table(okc_tr$diet)/sum(!is.na(okc_tr$diet)) diet_props <- sort(diet_props, decreasing = TRUE) diet_levels <- names(diet_props)[diet_props >= others$step[[1]]$threshold] for(i in diet_levels) expect_equal(sum(others_te$diet == i, na.rm =TRUE), sum(okc_te$diet == i, na.rm =TRUE)) diet_levels <- c(diet_levels, others$step[[1]]$objects[["diet"]]$other) expect_true(all(levels(others_te$diet) %in% diet_levels)) expect_true(all(diet_levels %in% levels(others_te$diet))) location_props <- table(okc_tr$location)/sum(!is.na(okc_tr$location)) location_props <- sort(location_props, decreasing = TRUE) location_levels <- names(location_props)[location_props >= others$step[[1]]$threshold] for(i in location_levels) expect_equal(sum(others_te$location == i, na.rm =TRUE), sum(okc_te$location == i, na.rm =TRUE)) location_levels <- c(location_levels, others$step[[1]]$objects[["location"]]$other) expect_true(all(levels(others_te$location) %in% location_levels)) expect_true(all(location_levels %in% levels(others_te$location))) expect_equal(is.na(okc_te$diet), is.na(others_te$diet)) expect_equal(is.na(okc_te$location), is.na(others_te$location)) }) test_that('high threshold - much removals', { others <- rec %>% step_other(diet, location, threshold = .5) others <- prep(others, training = okc_tr) others_te <- bake(others, newdata = okc_te) diet_props <- table(okc_tr$diet) diet_levels <- others$steps[[1]]$objects$diet$keep for(i in diet_levels) expect_equal(sum(others_te$diet == i, na.rm =TRUE), sum(okc_te$diet == i, na.rm =TRUE)) diet_levels <- c(diet_levels, others$step[[1]]$objects[["diet"]]$other) expect_true(all(levels(others_te$diet) %in% diet_levels)) expect_true(all(diet_levels %in% levels(others_te$diet))) location_props <- table(okc_tr$location) location_levels <- others$steps[[1]]$objects$location$keep for(i in location_levels) expect_equal(sum(others_te$location == i, na.rm =TRUE), sum(okc_te$location == i, na.rm =TRUE)) location_levels <- c(location_levels, others$step[[1]]$objects[["location"]]$other) expect_true(all(levels(others_te$location) %in% location_levels)) expect_true(all(location_levels %in% levels(others_te$location))) expect_equal(is.na(okc_te$diet), is.na(others_te$diet)) expect_equal(is.na(okc_te$location), is.na(others_te$location)) }) test_that('low threshold - no removals', { others <- rec %>% step_other(diet, location, threshold = 10^-10) others <- prep(others, training = okc_tr, stringsAsFactors = FALSE) others_te <- bake(others, newdata = okc_te) expect_equal(others$steps[[1]]$objects$diet$collapse, FALSE) expect_equal(others$steps[[1]]$objects$location$collapse, FALSE) expect_equal(okc_te$diet, others_te$diet) expect_equal(okc_te$location, others_te$location) }) test_that('factor inputs', { okc$diet <- as.factor(okc$diet) okc$location <- as.factor(okc$location) okc_tr <- okc[ in_train,] okc_te <- okc[-in_train,] rec <- recipe(~ diet + location, data = okc_tr) others <- rec %>% step_other(diet, location) others <- prep(others, training = okc_tr) others_te <- bake(others, newdata = okc_te) diet_props <- table(okc_tr$diet)/sum(!is.na(okc_tr$diet)) diet_props <- sort(diet_props, decreasing = TRUE) diet_levels <- names(diet_props)[diet_props >= others$step[[1]]$threshold] for(i in diet_levels) expect_equal(sum(others_te$diet == i, na.rm =TRUE), sum(okc_te$diet == i, na.rm =TRUE)) diet_levels <- c(diet_levels, others$step[[1]]$objects[["diet"]]$other) expect_true(all(levels(others_te$diet) %in% diet_levels)) expect_true(all(diet_levels %in% levels(others_te$diet))) location_props <- table(okc_tr$location)/sum(!is.na(okc_tr$location)) location_props <- sort(location_props, decreasing = TRUE) location_levels <- names(location_props)[location_props >= others$step[[1]]$threshold] for(i in location_levels) expect_equal(sum(others_te$location == i, na.rm =TRUE), sum(okc_te$location == i, na.rm =TRUE)) location_levels <- c(location_levels, others$step[[1]]$objects[["location"]]$other) expect_true(all(levels(others_te$location) %in% location_levels)) expect_true(all(location_levels %in% levels(others_te$location))) expect_equal(is.na(okc_te$diet), is.na(others_te$diet)) expect_equal(is.na(okc_te$location), is.na(others_te$location)) }) test_that('printing', { rec <- rec %>% step_other(diet, location) expect_output(print(rec)) expect_output(prep(rec, training = okc_tr)) }) recipes/tests/testthat/test_ns.R 0000644 0001777 0001777 00000004205 13135741217 020047 0 ustar herbrandt herbrandt library(testthat) library(recipes) data(biomass) library(splines) biomass_tr <- biomass[biomass$dataset == "Training",] biomass_te <- biomass[biomass$dataset == "Testing",] rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr) test_that('correct basis functions', { with_ns <- rec %>% step_ns(carbon, hydrogen) with_ns <- prep(with_ns, training = biomass_tr, verbose = FALSE) with_ns_pred_tr <- bake(with_ns, newdata = biomass_tr) with_ns_pred_te <- bake(with_ns, newdata = biomass_te) carbon_ns_tr_exp <- ns(biomass_tr$carbon, df = 2) hydrogen_ns_tr_exp <- ns(biomass_tr$hydrogen, df = 2) carbon_ns_te_exp <- predict(carbon_ns_tr_exp, biomass_te$carbon) hydrogen_ns_te_exp <- predict(hydrogen_ns_tr_exp, biomass_te$hydrogen) carbon_ns_tr_res <- as.matrix(with_ns_pred_tr[, grep("carbon", names(with_ns_pred_tr))]) colnames(carbon_ns_tr_res) <- NULL hydrogen_ns_tr_res <- as.matrix(with_ns_pred_tr[, grep("hydrogen", names(with_ns_pred_tr))]) colnames(hydrogen_ns_tr_res) <- NULL carbon_ns_te_res <- as.matrix(with_ns_pred_te[, grep("carbon", names(with_ns_pred_te))]) colnames(carbon_ns_te_res) <- 1:ncol(carbon_ns_te_res) hydrogen_ns_te_res <- as.matrix(with_ns_pred_te[, grep("hydrogen", names(with_ns_pred_te))]) colnames(hydrogen_ns_te_res) <- 1:ncol(hydrogen_ns_te_res) ## remove attributes carbon_ns_tr_exp <- matrix(carbon_ns_tr_exp, ncol = 2) carbon_ns_te_exp <- matrix(carbon_ns_te_exp, ncol = 2) hydrogen_ns_tr_exp <- matrix(hydrogen_ns_tr_exp, ncol = 2) hydrogen_ns_te_exp <- matrix(hydrogen_ns_te_exp, ncol = 2) dimnames(carbon_ns_tr_res) <- NULL dimnames(carbon_ns_te_res) <- NULL dimnames(hydrogen_ns_tr_res) <- NULL dimnames(hydrogen_ns_te_res) <- NULL expect_equal(carbon_ns_tr_res, carbon_ns_tr_exp) expect_equal(carbon_ns_te_res, carbon_ns_te_exp) expect_equal(hydrogen_ns_tr_res, hydrogen_ns_tr_exp) expect_equal(hydrogen_ns_te_res, hydrogen_ns_te_exp) }) test_that('printing', { with_ns <- rec %>% step_ns(carbon, hydrogen) expect_output(print(with_ns)) expect_output(prep(with_ns, training = biomass_tr)) }) recipes/tests/testthat/test_poly.R 0000644 0001777 0001777 00000004364 13135741217 020420 0 ustar herbrandt herbrandt library(testthat) library(recipes) data(biomass) biomass_tr <- biomass[biomass$dataset == "Training",] biomass_te <- biomass[biomass$dataset == "Testing",] rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr) test_that('correct basis functions', { with_poly <- rec %>% step_poly(carbon, hydrogen) with_poly <- prep(with_poly, training = biomass_tr, verbose = FALSE) with_poly_pred_tr <- bake(with_poly, newdata = biomass_tr) with_poly_pred_te <- bake(with_poly, newdata = biomass_te) carbon_poly_tr_exp <- poly(biomass_tr$carbon, degree = 2) hydrogen_poly_tr_exp <- poly(biomass_tr$hydrogen, degree = 2) carbon_poly_te_exp <- predict(carbon_poly_tr_exp, biomass_te$carbon) hydrogen_poly_te_exp <- predict(hydrogen_poly_tr_exp, biomass_te$hydrogen) carbon_poly_tr_res <- as.matrix(with_poly_pred_tr[, grep("carbon", names(with_poly_pred_tr))]) colnames(carbon_poly_tr_res) <- NULL hydrogen_poly_tr_res <- as.matrix(with_poly_pred_tr[, grep("hydrogen", names(with_poly_pred_tr))]) colnames(hydrogen_poly_tr_res) <- NULL carbon_poly_te_res <- as.matrix(with_poly_pred_te[, grep("carbon", names(with_poly_pred_te))]) colnames(carbon_poly_te_res) <- 1:ncol(carbon_poly_te_res) hydrogen_poly_te_res <- as.matrix(with_poly_pred_te[, grep("hydrogen", names(with_poly_pred_te))]) colnames(hydrogen_poly_te_res) <- 1:ncol(hydrogen_poly_te_res) ## remove attributes carbon_poly_tr_exp <- matrix(carbon_poly_tr_exp, ncol = 2) carbon_poly_te_exp <- matrix(carbon_poly_te_exp, ncol = 2) hydrogen_poly_tr_exp <- matrix(hydrogen_poly_tr_exp, ncol = 2) hydrogen_poly_te_exp <- matrix(hydrogen_poly_te_exp, ncol = 2) dimnames(carbon_poly_tr_res) <- NULL dimnames(carbon_poly_te_res) <- NULL dimnames(hydrogen_poly_tr_res) <- NULL dimnames(hydrogen_poly_te_res) <- NULL expect_equal(carbon_poly_tr_res, carbon_poly_tr_exp) expect_equal(carbon_poly_te_res, carbon_poly_te_exp) expect_equal(hydrogen_poly_tr_res, hydrogen_poly_tr_exp) expect_equal(hydrogen_poly_te_res, hydrogen_poly_te_exp) }) test_that('printing', { with_poly <- rec %>% step_poly(carbon, hydrogen) expect_output(print(with_poly)) expect_output(prep(with_poly, training = biomass_tr)) }) recipes/tests/testthat/test_interact.R 0000644 0001777 0001777 00000004525 13135741217 021245 0 ustar herbrandt herbrandt library(testthat) library(recipes) data("biomass") tr_biomass <- subset(biomass, dataset == "Training")[, -(1:2)] te_biomass <- subset(biomass, dataset == "Testing")[, -(1:2)] rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = tr_biomass) test_that('non-factor variables with dot', { int_rec <- rec %>% step_interact(~(.-HHV)^3, sep=":") int_rec_trained <- prep(int_rec, training = tr_biomass, verbose = FALSE) te_new <- bake(int_rec_trained, newdata = te_biomass, all_predictors()) te_new <- te_new[, sort(names(te_new))] te_new <- as.matrix(te_new) og_terms <- terms(~(.-HHV)^3, data = te_biomass) te_og <- model.matrix(og_terms, data = te_biomass)[, -1] te_og <- te_og[, sort(colnames(te_og))] rownames(te_new) <- NULL rownames(te_og) <- NULL expect_equal(te_og, te_new) }) test_that('non-factor variables with specific variables', { int_rec <- rec %>% step_interact(~carbon:hydrogen + oxygen:nitrogen:sulfur, sep = ":") int_rec_trained <- prep(int_rec, training = tr_biomass, verbose = FALSE) te_new <- bake(int_rec_trained, newdata = te_biomass, all_predictors()) te_new <- te_new[, sort(names(te_new))] te_new <- as.matrix(te_new) og_terms <- terms(~carbon + hydrogen + oxygen + nitrogen + sulfur + carbon:hydrogen + oxygen:nitrogen:sulfur, data = te_biomass) te_og <- model.matrix(og_terms, data = te_biomass)[, -1] te_og <- te_og[, sort(colnames(te_og))] rownames(te_new) <- NULL rownames(te_og) <- NULL expect_equal(te_og, te_new) }) test_that('printing', { int_rec <- rec %>% step_interact(~carbon:hydrogen) expect_output(print(int_rec)) expect_output(prep(int_rec, training = tr_biomass)) }) # currently failing; try to figure out why # test_that('with factors', { # int_rec <- recipe(Sepal.Width ~ ., data = iris) %>% # step_interact(~ (. - Sepal.Width)^3, sep = ":") # int_rec_trained <- prep(int_rec, iris) # # te_new <- bake(int_rec_trained, newdata = iris, role = "predictor") # te_new <- te_new[, sort(names(te_new))] # te_new <- as.matrix(te_new) # # og_terms <- terms(Sepal.Width ~ (.)^3, data = iris) # te_og <- model.matrix(og_terms, data = iris)[, -1] # te_og <- te_og[, sort(colnames(te_og))] # # rownames(te_new) <- NULL # rownames(te_og) <- NULL # # all.equal(te_og, te_new) # }) recipes/tests/testthat/test_lincomb.R 0000644 0001777 0001777 00000004142 13135741217 021052 0 ustar herbrandt herbrandt library(testthat) library(recipes) dummies <- cbind(model.matrix( ~ block - 1, npk), model.matrix( ~ N - 1, npk), model.matrix( ~ P - 1, npk), model.matrix( ~ K - 1, npk), yield = npk$yield) dummies <- as.data.frame(dummies) dum_rec <- recipe(yield ~ . , data = dummies) ################################################################### data(biomass) biomass$new_1 <- with(biomass, .1*carbon - .2*hydrogen + .6*sulfur) biomass$new_2 <- with(biomass, .5*carbon - .2*oxygen + .6*nitrogen) biomass_tr <- biomass[biomass$dataset == "Training",] biomass_te <- biomass[biomass$dataset == "Testing",] biomass_rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur + new_1 + new_2, data = biomass_tr) ################################################################### test_that('example 1', { dum_filtered <- dum_rec %>% step_lincomb(all_predictors()) dum_filtered <- prep(dum_filtered, training = dummies, verbose = FALSE) removed <- c("N1", "P1", "K1") expect_equal(dum_filtered$steps[[1]]$removals, removed) }) test_that('example 2', { lincomb_filter <- biomass_rec %>% step_lincomb(all_predictors()) filtering_trained <- prep(lincomb_filter, training = biomass_tr) test_res <- bake(filtering_trained, newdata = biomass_te) expect_true(all(!(paste0("new_", 1:2) %in% colnames(test_res)))) }) test_that('no exclusions', { biomass_rec_2 <- recipe(HHV ~ carbon + hydrogen, data = biomass_tr) lincomb_filter_2 <- biomass_rec_2 %>% step_lincomb(all_predictors()) filtering_trained_2 <- prep(lincomb_filter_2, training = biomass_tr) test_res_2 <- bake(filtering_trained_2, newdata = biomass_te) expect_true(length(filtering_trained_2$steps[[1]]$removals) == 0) expect_true(all(colnames(test_res_2) == c("carbon", "hydrogen"))) }) test_that('printing', { dum_filtered <- dum_rec %>% step_lincomb(all_predictors()) expect_output(print(dum_filtered)) expect_output(prep(dum_filtered, training = dummies)) }) recipes/tests/testthat/test_depth.R 0000644 0001777 0001777 00000003176 13135741217 020541 0 ustar herbrandt herbrandt library(testthat) library(recipes) library(ddalpha) test_that("defaults", { rec <- recipe(Species ~ ., data = iris) %>% step_depth(all_predictors(), class = "Species", metric = "spatial") trained <- prep(rec, training = iris, verbose = FALSE) depths <- bake(trained, newdata = iris) depths <- depths[, grepl("depth", names(depths))] depths <- as.data.frame(depths) split_up <- split(iris[, 1:4], iris$Species) spatial <- function(x, y) depth.spatial(x = y, data = x) exp_res <- lapply(split_up, spatial, y = iris[, 1:4]) exp_res <- as.data.frame(exp_res) for(i in 1:ncol(exp_res)) expect_equal(depths[, i], exp_res[, i]) }) test_that("alt args", { rec <- recipe(Species ~ ., data = iris) %>% step_depth(all_predictors(), class = "Species", metric = "Mahalanobis", options = list(mah.estimate = "MCD", mah.parMcd = .75)) trained <- prep(rec, training = iris, verbose = FALSE) depths <- bake(trained, newdata = iris) depths <- depths[, grepl("depth", names(depths))] depths <- as.data.frame(depths) split_up <- split(iris[, 1:4], iris$Species) Mahalanobis <- function(x, y) depth.Mahalanobis(x = y, data = x, mah.estimate = "MCD", mah.parMcd = .75) exp_res <- lapply(split_up, Mahalanobis, y = iris[, 1:4]) exp_res <- as.data.frame(exp_res) head(exp_res) head(depths) for(i in 1:ncol(exp_res)) expect_equal(depths[, i], exp_res[, i]) }) test_that('printing', { rec <- recipe(Species ~ ., data = iris) %>% step_depth(all_predictors(), class = "Species", metric = "spatial") expect_output(print(rec)) expect_output(prep(rec, training = iris)) }) recipes/tests/testthat/test_stringsAsFactors.R 0000644 0001777 0001777 00000002546 13135741217 022734 0 ustar herbrandt herbrandt library(testthat) library(recipes) n <- 20 set.seed(752) as_fact <- data.frame( numbers = rnorm(n), fact = factor(sample(letters[1:3], n, replace = TRUE)), ord = factor(sample(LETTERS[22:26], n, replace = TRUE), ordered = TRUE) ) as_str <- as_fact as_str$fact <- as.character(as_str$fact) as_str$ord <- as.character(as_str$ord) test_that('stringsAsFactors = FALSE', { rec1 <- recipe(~ ., data = as_fact) %>% step_center(numbers) rec1 <- prep(rec1, training = as_fact, retain = TRUE, stringsAsFactors = FALSE, verbose = FALSE) rec1_as_fact <- bake(rec1, newdata = as_fact) rec1_as_str <- bake(rec1, newdata = as_str) expect_equal(as_fact$fact, rec1_as_fact$fact) expect_equal(as_fact$ord, rec1_as_fact$ord) expect_equal(as_str$fact, rec1_as_str$fact) expect_equal(as_str$ord, rec1_as_str$ord) }) test_that('stringsAsFactors = TRUE', { rec2 <- recipe(~ ., data = as_fact) %>% step_center(numbers) rec2 <- prep(rec2, training = as_fact, retain = TRUE, stringsAsFactors = TRUE, verbose = FALSE) rec2_as_fact <- bake(rec2, newdata = as_fact) rec2_as_str <- bake(rec2, newdata = as_str) expect_equal(as_fact$fact, rec2_as_fact$fact) expect_equal(as_fact$ord, rec2_as_fact$ord) expect_equal(as_fact$fact, rec2_as_str$fact) expect_equal(as_fact$ord, rec2_as_str$ord) }) recipes/tests/testthat/test_center_scale.R 0000644 0001777 0001777 00000004154 13135741217 022061 0 ustar herbrandt herbrandt library(testthat) context("Testing center and scale") library(recipes) means <- vapply(biomass[, 3:7], mean, c(mean = 0)) sds <- vapply(biomass[, 3:7], sd, c(sd = 0)) rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass) test_that('correct means and std devs', { standardized <- rec %>% step_center(carbon, hydrogen, oxygen, nitrogen, sulfur) %>% step_scale(carbon, hydrogen, oxygen, nitrogen, sulfur) standardized_trained <- prep(standardized, training = biomass, verbose = FALSE) expect_equal(standardized_trained$steps[[1]]$means, means) expect_equal(standardized_trained$steps[[2]]$sds, sds) }) test_that('training in stages', { at_once <- rec %>% step_center(carbon, hydrogen, oxygen, nitrogen, sulfur) %>% step_scale(carbon, hydrogen, oxygen, nitrogen, sulfur) at_once_trained <- prep(at_once, training = biomass, verbose = FALSE) ## not train in stages center_first <- rec %>% step_center(carbon, hydrogen, oxygen, nitrogen, sulfur) center_first_trained <- prep(center_first, training = biomass, verbose = FALSE) in_stages <- center_first_trained %>% step_scale(carbon, hydrogen, oxygen, nitrogen, sulfur) in_stages_trained <- prep(in_stages, training = biomass, verbose = FALSE) in_stages_retrained <- prep(in_stages, training = biomass, verbose = FALSE, fresh = TRUE) expect_equal(at_once_trained, in_stages_trained) expect_equal(at_once_trained, in_stages_retrained) }) test_that('single predictor', { standardized <- rec %>% step_center(carbon) %>% step_scale(hydrogen) standardized_trained <- prep(standardized, training = biomass, verbose = FALSE) results <- bake(standardized_trained, biomass) exp_res <- biomass[, 3:8] exp_res$carbon <- exp_res$carbon - mean(exp_res$carbon) exp_res$hydrogen <- exp_res$hydrogen / sd(exp_res$hydrogen) expect_equal(as.data.frame(results), exp_res[, colnames(results)]) }) test_that('printing', { standardized <- rec %>% step_center(carbon) %>% step_scale(hydrogen) expect_output(print(standardized)) expect_output(prep(standardized, training = biomass)) }) recipes/tests/testthat/test_holiday.R 0000644 0001777 0001777 00000004240 13135741217 021057 0 ustar herbrandt herbrandt library(testthat) library(recipes) library(lubridate) exp_dates <- data.frame(date = ymd(c("2017-12-25", "2017-05-29", "2017-04-16")), holiday = c("ChristmasDay", "USMemorialDay", "Easter"), stringsAsFactors = FALSE) test_data <- data.frame(day = ymd("2017-01-01") + days(0:364)) test_that('Date class', { holiday_rec <- recipe(~ day, test_data) %>% step_holiday(all_predictors(), holidays = exp_dates$holiday) holiday_rec <- prep(holiday_rec, training = test_data) holiday_ind <- bake(holiday_rec, test_data) all.equal(holiday_ind$day[holiday_ind$day_USMemorialDay == 1], exp_dates$date[exp_dates$holiday == "USMemorialDay"]) expect_equal(holiday_ind$day[holiday_ind$day_USMemorialDay == 1], exp_dates$date[exp_dates$holiday == "USMemorialDay"]) expect_equal(holiday_ind$day[holiday_ind$day_ChristmasDay == 1], exp_dates$date[exp_dates$holiday == "ChristmasDay"]) expect_equal(holiday_ind$day[holiday_ind$day_Easter == 1], exp_dates$date[exp_dates$holiday == "Easter"]) }) test_that('POSIXct class', { test_data$day <- as.POSIXct(test_data$day) exp_dates$date <- as.POSIXct(exp_dates$date) holiday_rec <- recipe(~ day, test_data) %>% step_holiday(all_predictors(), holidays = exp_dates$holiday) holiday_rec <- prep(holiday_rec, training = test_data) holiday_ind <- bake(holiday_rec, test_data) all.equal(holiday_ind$day[holiday_ind$day_USMemorialDay == 1], exp_dates$date[exp_dates$holiday == "USMemorialDay"]) expect_equal(holiday_ind$day[holiday_ind$day_USMemorialDay == 1], exp_dates$date[exp_dates$holiday == "USMemorialDay"]) expect_equal(holiday_ind$day[holiday_ind$day_ChristmasDay == 1], exp_dates$date[exp_dates$holiday == "ChristmasDay"]) expect_equal(holiday_ind$day[holiday_ind$day_Easter == 1], exp_dates$date[exp_dates$holiday == "Easter"]) }) test_that('printing', { holiday_rec <- recipe(~ day, test_data) %>% step_holiday(all_predictors(), holidays = exp_dates$holiday) expect_output(print(holiday_rec)) expect_output(prep(holiday_rec, training = test_data)) }) recipes/tests/testthat/test_dummies.R 0000644 0001777 0001777 00000002345 13135741217 021075 0 ustar herbrandt herbrandt library(testthat) library(recipes) data(okc) okc$location <- gsub(", california", "", okc$location) okc$diet[is.na(okc$diet)] <- "missing" okc <- okc[complete.cases(okc), -5] okc_fac <- data.frame(okc) test_that('dummy variables with string inputs', { rec <- recipe(age ~ ., data = okc) dummy <- rec %>% step_dummy(diet, location) dummy_trained <- prep(dummy, training = okc, verbose = FALSE, stringsAsFactors = FALSE) dummy_pred <- bake(dummy_trained, newdata = okc) dummy_pred <- dummy_pred[, order(colnames(dummy_pred))] dummy_pred <- as.data.frame(dummy_pred) rownames(dummy_pred) <- NULL exp_res <- model.matrix(age ~ ., data = okc_fac)[, -1] exp_res <- exp_res[, colnames(exp_res) != "age"] colnames(exp_res) <- gsub("^location", "location_", colnames(exp_res)) colnames(exp_res) <- gsub("^diet", "diet_", colnames(exp_res)) colnames(exp_res) <- make.names(colnames(exp_res)) exp_res <- exp_res[, order(colnames(exp_res))] exp_res <- as.data.frame(exp_res) rownames(exp_res) <- NULL expect_equal(dummy_pred, exp_res) }) test_that('printing', { rec <- recipe(age ~ ., data = okc) dummy <- rec %>% step_dummy(diet, location) expect_output(print(dummy)) expect_output(prep(dummy, training = okc)) }) recipes/tests/testthat/test_meanimpute.R 0000644 0001777 0001777 00000003272 13135741217 021576 0 ustar herbrandt herbrandt library(testthat) library(recipes) data("credit_data") set.seed(342) in_training <- sample(1:nrow(credit_data), 2000) credit_tr <- credit_data[ in_training, ] credit_te <- credit_data[-in_training, ] test_that('simple mean', { rec <- recipe(Price ~ ., data = credit_tr) impute_rec <- rec %>% step_meanimpute(Age, Assets, Income) imputed <- prep(impute_rec, training = credit_tr, verbose = FALSE) te_imputed <- bake(imputed, newdata = credit_te) expect_equal(te_imputed$Age, credit_te$Age) expect_equal(te_imputed$Assets[is.na(credit_te$Assets)], rep(mean(credit_tr$Assets, na.rm = TRUE), sum(is.na(credit_te$Assets)))) expect_equal(te_imputed$Income[is.na(credit_te$Income)], rep(mean(credit_tr$Income, na.rm = TRUE), sum(is.na(credit_te$Income)))) }) test_that('trimmed mean', { rec <- recipe(Price ~ ., data = credit_tr) impute_rec <- rec %>% step_meanimpute(Assets, trim = .1) imputed <- prep(impute_rec, training = credit_tr, verbose = FALSE) te_imputed <- bake(imputed, newdata = credit_te) expect_equal(te_imputed$Assets[is.na(credit_te$Assets)], rep(mean(credit_tr$Assets, na.rm = TRUE, trim = .1), sum(is.na(credit_te$Assets)))) }) test_that('non-numeric', { rec <- recipe(Price ~ ., data = credit_tr) impute_rec <- rec %>% step_meanimpute(Assets, Job) expect_error(prep(impute_rec, training = credit_tr, verbose = FALSE)) }) test_that('printing', { impute_rec <- recipe(Price ~ ., data = credit_tr) %>% step_meanimpute(Age, Assets, Income) expect_output(print(impute_rec)) expect_output(prep(impute_rec, training = credit_tr)) }) recipes/tests/testthat/test_ratio.R 0000644 0001777 0001777 00000004376 13135741217 020556 0 ustar herbrandt herbrandt library(testthat) library(recipes) library(tibble) n <- 20 ex_dat <- data.frame( x1 = -1:8, x2 = 1, x3 = c(1:9, NA), x4 = 11:20, x5 = letters[1:10] ) rec <- recipe( ~ x1 + x2 + x3 + x4 + x5, data = ex_dat) test_that('1:many', { rec1 <- rec %>% step_ratio(x1, denom = denom_vars(all_numeric())) rec1 <- prep(rec1, ex_dat, verbose = FALSE) obs1 <- bake(rec1, ex_dat) res1 <- tibble( x1_o_x2 = ex_dat$x1/ex_dat$x2, x1_o_x3 = ex_dat$x1/ex_dat$x3, x1_o_x4 = ex_dat$x1/ex_dat$x4 ) for(i in names(res1)) expect_equal(res1[i], obs1[i]) }) test_that('many:1', { rec2 <- rec %>% step_ratio(all_numeric(), denom = denom_vars(x1)) rec2 <- prep(rec2, ex_dat, verbose = FALSE) obs2 <- bake(rec2, ex_dat) res2 <- tibble( x2_o_x1 = ex_dat$x2/ex_dat$x1, x3_o_x1 = ex_dat$x3/ex_dat$x1, x4_o_x1 = ex_dat$x4/ex_dat$x1 ) for(i in names(res2)) expect_equal(res2[i], obs2[i]) }) test_that('many:many', { rec3 <- rec %>% step_ratio(all_numeric(), denom = denom_vars(all_numeric())) rec3 <- prep(rec3, ex_dat, verbose = FALSE) obs3 <- bake(rec3, ex_dat) res3 <- tibble( x2_o_x1 = ex_dat$x2/ex_dat$x1, x3_o_x1 = ex_dat$x3/ex_dat$x1, x4_o_x1 = ex_dat$x4/ex_dat$x1, x1_o_x2 = ex_dat$x1/ex_dat$x2, x3_o_x2 = ex_dat$x3/ex_dat$x2, x4_o_x2 = ex_dat$x4/ex_dat$x2, x1_o_x3 = ex_dat$x1/ex_dat$x3, x2_o_x3 = ex_dat$x2/ex_dat$x3, x4_o_x3 = ex_dat$x4/ex_dat$x3, x1_o_x4 = ex_dat$x1/ex_dat$x4, x2_o_x4 = ex_dat$x2/ex_dat$x4, x3_o_x4 = ex_dat$x3/ex_dat$x4 ) for(i in names(res3)) expect_equal(res3[i], obs3[i]) }) test_that('wrong type', { rec4 <- rec %>% step_ratio(x1, denom = denom_vars(all_predictors())) expect_error(prep(rec4, ex_dat, verbose = FALSE)) rec5 <- rec %>% step_ratio(all_predictors(), denom = denom_vars(x1)) expect_error(prep(rec5, ex_dat, verbose = FALSE)) rec6 <- rec %>% step_ratio(all_predictors(), denom = denom_vars(all_predictors())) expect_error(prep(rec6, ex_dat, verbose = FALSE)) }) test_that('printing', { rec3 <- rec %>% step_ratio(all_numeric(), denom = denom_vars(all_numeric())) expect_output(print(rec3)) expect_output(prep(rec3, training = ex_dat)) }) recipes/tests/testthat/test_sqrt.R 0000644 0001777 0001777 00000001205 13135741217 020415 0 ustar herbrandt herbrandt library(testthat) library(recipes) library(tibble) n <- 20 ex_dat <- data.frame(x1 = seq(0, 1, length = n), x2 = rep(1:5, 4)) test_that('simple sqrt trans', { rec <- recipe(~., data = ex_dat) %>% step_sqrt(x1, x2) rec_trained <- prep(rec, training = ex_dat, verbose = FALSE) rec_trans <- bake(rec_trained, newdata = ex_dat) exp_res <- as_tibble(lapply(ex_dat, sqrt)) expect_equal(rec_trans, exp_res) }) test_that('printing', { rec <- recipe(~., data = ex_dat) %>% step_sqrt(x1, x2) expect_output(print(rec)) expect_output(prep(rec, training = ex_dat)) }) recipes/tests/testthat/test_kpca.R 0000644 0001777 0001777 00000002154 13135741217 020346 0 ustar herbrandt herbrandt library(testthat) library(kernlab) library(recipes) set.seed(131) tr_dat <- matrix(rnorm(100*6), ncol = 6) te_dat <- matrix(rnorm(20*6), ncol = 6) colnames(tr_dat) <- paste0("X", 1:6) colnames(te_dat) <- paste0("X", 1:6) rec <- recipe(X1 ~ ., data = tr_dat) test_that('correct kernel PCA values', { kpca_rec <- rec %>% step_kpca(X2, X3, X4, X5, X6) kpca_trained <- prep(kpca_rec, training = tr_dat, verbose = FALSE) pca_pred <- bake(kpca_trained, newdata = te_dat) pca_pred <- as.matrix(pca_pred) pca_exp <- kpca(as.matrix(tr_dat[, -1]), kernel = kpca_rec$steps[[1]]$options$kernel, kpar = kpca_rec$steps[[1]]$options$kpar) pca_pred_exp <- kernlab::predict(pca_exp, te_dat[, -1])[, 1:kpca_trained$steps[[1]]$num] colnames(pca_pred_exp) <- paste0("kPC", 1:kpca_trained$steps[[1]]$num) rownames(pca_pred) <- NULL rownames(pca_pred_exp) <- NULL expect_equal(pca_pred, pca_pred_exp) }) test_that('printing', { kpca_rec <- rec %>% step_kpca(X2, X3, X4, X5, X6) expect_output(print(kpca_rec)) expect_output(prep(kpca_rec, training = tr_dat)) }) recipes/tests/testthat/test_ica.R 0000644 0001777 0001777 00000011672 13135741217 020171 0 ustar herbrandt herbrandt library(testthat) library(recipes) data(biomass) biomass_tr <- biomass[biomass$dataset == "Training",] biomass_te <- biomass[biomass$dataset == "Testing",] rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr) ## Generated using fastICA exp_comp <- structure( c(-0.741586750989301, -0.473165319478851, -0.532724778033598, 0.347336643017696, -0.523140911818999, 0.0839020928800183, -0.689112937865132, 1.1905359062157, 2.87389193916233, 3.87655326677861, 0.662748883270711, 0.108848159489063, 0.509384921516091, -0.708397515735095, -0.129606867389727, 1.7900565287023, 0.171125628795304, 0.314289325954585, -0.142425199147843, -0.619509248504534, 0.38690051207701, -0.414352364956822, -0.609744599991299, -0.144705519030626, -0.293470631707416, -0.791746573929697, -0.634208572824357, 1.36675934105489, -0.785855217530414, -0.730790987290872, -0.236417274868796, -0.210596011735952, -0.413793241941344, -0.511246150962085, -0.181254985021062, 0.298659496162363, -0.757969803548959, -0.666845883775384, -0.240983277334825, -0.394806974813201, 1.44451054341856, 3.33833135277739, -0.54575996404394, -0.423145023192357, -0.388925027133234, -0.418629250017466, -0.463085718807788, -0.14499128867367, 0.323243757311295, -0.417689940076107, -0.777761367811451, -0.799107717902467, -0.548346133015069, 0.769235286712577, -0.40466870434895, -0.591389964794494, -0.208052301856056, -0.945352336400244, 0.919793619211536, -0.561549525440524, -0.535789943464846, -0.735536725127484, 3.7162236121338, 0.459835444175181, 0.137984939011763, -0.755831873688866, -0.757751495230731, -0.512815283263682, 0.901123348803226, -0.755032174981781, -1.04745496967861, -0.481720409476034, -0.956534770489922, 2.39775097011864, -0.537189360991569, 0.455171520278689, -0.764070183446535, -0.0133182183358093, 0.0084094629323547, -0.11887530759164, -0.50492491720854, -0.731237740789087, -0.810056304451282, -0.0654477889270799, -0.165218457853762, -0.384457532271443, -1.25744957888255, -0.164838366701182, -0.818591960610985, -0.577844253001226, 0.159731749239493, -0.350242543749645, 3.22437340069565, -0.575271823706669, -0.171250094126726, 1.21819592885382, -0.303636775510361, 0.192247367642684, 0.235728177283036, -0.768212986589321, 0.333147682813931, -0.403932170943429, -0.261749940045069, -0.331436881499356, -0.298793661022028, -0.255788540744319, -0.764483629396313, -0.162133725599773, -0.10676549266036, -0.349722429991475, -0.340728544016434, -0.358565693266084, 0.0242508678396987, -0.277425329351928, 0.055217077863271, 0.146403703647814, -0.241268230680493, -0.283770652745491, -0.573657866580657, -0.224655195396099, 0.226079102614757, 2.03305968574443, -0.225655562941607, -0.155789455588855, -0.613828894885655, 0.480057477445702, 0.277055812270816, -0.263765068404404, 0.0411239101983566, 0.30164066516454, -0.760891669412883, -0.478609196612072, -0.162692709808673, 3.12547570195871, -0.189300748528298, -0.16882558146447, -0.30745201359965, 2.77823976198232, -0.306599455530011, -0.979722296618571, -0.913952653732135, -0.608622766593967, -0.061561169157735, 0.0134953299517241, -0.111595843415483, -0.0995809192931606, -0.353150299985198, -0.173474040260694, -0.11913118533085, -0.268152445374219, -1.64524056576117, -0.052825674116391, 2.82692828099746, -0.257823074604271, -0.0316348082448068, -0.347414676200845, -0.237534967478309, -0.266298103195764, -0.0555773569483491, 2.35155293218832), .Dim = c(80L, 2L), .Dimnames = list(c("15", "20", "26", "31", "36", "41", "46", "51", "55", "65", "69", "73", "76", "88", "91", "126", "132", "136", "141", "147", "155", "162", "167", "173", "178", "183", "190", "196", "203", "208", "213", "218", "223", "230", "235", "241", "252", "257", "262", "267", "277", "282", "286", "294", "299", "305", "309", "314", "319", "325", "330", "348", "353", "357", "359", "370", "375", "385", "399", "407", "409", "414", "419", "424", "429", "434", "439", "448", "467", "473", "477", "482", "485", "493", "499", "516", "519", "527", "532", "535" ), c("IC1", "IC2"))) rownames(exp_comp) <- NULL test_that('correct ICA values', { ica_extract <- rec %>% step_ica(carbon, hydrogen, oxygen, nitrogen, sulfur, num = 2) set.seed(12) ica_extract_trained <- prep(ica_extract, training = biomass_tr, verbose = FALSE) ica_pred <- bake(ica_extract_trained, newdata = biomass_te) ica_pred <- as.matrix(ica_pred) rownames(ica_pred) <- NULL expect_equal(ica_pred, exp_comp) }) test_that('printing', { ica_extract <- rec %>% step_ica(carbon, hydrogen, num = 2) expect_output(print(ica_extract)) expect_output(prep(ica_extract, training = biomass_tr)) }) recipes/tests/testthat/test_discretized.R 0000644 0001777 0001777 00000002214 13135741217 021736 0 ustar herbrandt herbrandt library(testthat) library(recipes) ex_tr <- data.frame(x1 = 1:100, x2 = rep(1:5, each = 20), x3 = factor(rep(letters[1:2], each = 50))) ex_te <- data.frame(x1 = c(1, 50, 101, NA)) lvls_breaks_4 <- c('bin_missing', 'bin1', 'bin2', 'bin3', 'bin4') test_that('default args', { bin_1 <- discretize(ex_tr$x1) pred_1 <- predict(bin_1, ex_te$x1) exp_1 <- factor(c("bin1", "bin2", "bin4", "bin_missing"), levels = lvls_breaks_4) expect_equal(pred_1, exp_1) }) test_that('NA values', { bin_2 <- discretize(ex_tr$x1, keep_na = FALSE) pred_2 <- predict(bin_2, ex_te$x1) exp_2 <- factor(c("bin1", "bin2", "bin4", NA), levels = lvls_breaks_4[-1]) expect_equal(pred_2, exp_2) }) test_that('NA values from out of range', { bin_3 <- discretize(ex_tr$x1, keep_na = FALSE, infs = FALSE) pred_3 <- predict(bin_3, ex_te$x1) exp_3 <- factor(c("bin1", "bin2", NA, NA), levels = lvls_breaks_4[-1]) expect_equal(pred_3, exp_3) }) test_that('printing', { rec <- recipe(~., data = ex_tr) %>% step_discretize(x1) expect_output(print(rec)) expect_output(prep(rec, training = ex_tr)) }) recipes/tests/testthat/test_bagimpute.R 0000644 0001777 0001777 00000003414 13135741217 021405 0 ustar herbrandt herbrandt library(testthat) library(ipred) library(rpart) library(recipes) data("biomass") biomass$fac <- factor(sample(letters[1:2], size = nrow(biomass), replace = TRUE)) rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur + fac, data = biomass) test_that('imputation models', { imputed <- rec %>% step_bagimpute(carbon, fac, impute_with = imp_vars(hydrogen, oxygen), seed_val = 12) imputed_trained <- prep(imputed, training = biomass, verbose = FALSE) ## make sure we get the same trees given the same random samples carb_samps <- lapply(imputed_trained$steps[[1]]$models[["carbon"]]$mtrees, function(x) x$bindx) for(i in seq_along(carb_samps)) { carb_data <- biomass[carb_samps[[i]], c("carbon", "hydrogen", "oxygen")] carb_mod <- rpart(carbon ~ ., data = carb_data, control= rpart.control(xval=0)) expect_equal(carb_mod$splits, imputed_trained$steps[[1]]$models[["carbon"]]$mtrees[[i]]$btree$splits) } fac_samps <- lapply(imputed_trained$steps[[1]]$models[[1]]$mtrees, function(x) x$bindx) fac_ctrl <- imputed_trained$steps[[1]]$models[["fac"]]$mtrees[[1]]$btree$control ## make sure we get the same trees given the same random samples for(i in seq_along(fac_samps)) { fac_data <- biomass[fac_samps[[i]], c("fac", "hydrogen", "oxygen")] fac_mod <- rpart(fac ~ ., data = fac_data, control= fac_ctrl) expect_equal(fac_mod$splits, imputed_trained$steps[[1]]$models[["fac"]]$mtrees[[i]]$btree$splits) } }) test_that('printing', { imputed <- rec %>% step_bagimpute(carbon, impute_with = imp_vars(hydrogen), seed_val = 12) expect_output(print(imputed)) expect_output(prep(imputed, training = biomass)) }) recipes/tests/testthat/test_select_terms.R 0000644 0001777 0001777 00000005160 13125050130 022103 0 ustar herbrandt herbrandt library(testthat) library(recipes) library(tibble) library(tidyselect) library(rlang) data(okc) rec1 <- recipe(~ ., data = okc) info1 <- summary(rec1) data(biomass) rec2 <- recipe(biomass) %>% add_role(carbon, hydrogen, oxygen, nitrogen, sulfur, new_role = "predictor") %>% add_role(HHV, new_role = "outcome") %>% add_role(sample, new_role = "id variable") %>% add_role(dataset, new_role = "splitting indicator") info2 <- summary(rec2) test_that('simple role selections', { expect_equal( terms_select(info = info1, quos(all_predictors())), info1$variable ) expect_error(terms_select(info = info1, quos(all_outcomes()))) expect_equal( terms_select(info = info2, quos(all_outcomes())), "HHV" ) expect_equal( terms_select(info = info2, quos(has_role("splitting indicator"))), "dataset" ) }) test_that('simple type selections', { expect_equal( terms_select(info = info1, quos(all_numeric())), c("age", "height") ) expect_equal( terms_select(info = info1, quos(has_type("date"))), "date" ) expect_equal( terms_select(info = info1, quos(all_nominal())), c("diet", "location") ) }) test_that('simple name selections', { expect_equal( terms_select(info = info1, quos(matches("e$"))), c("age", "date") ) expect_equal( terms_select(info = info2, quos(contains("gen"))), c("hydrogen", "oxygen", "nitrogen") ) expect_equal( terms_select(info = info2, quos(contains("gen"), -nitrogen)), c("hydrogen", "oxygen") ) expect_equal( terms_select(info = info1, quos(date, age)), c("date", "age") ) ## This is weird but consistent with `dplyr::select_vars` expect_equal( terms_select(info = info1, quos(-age, date)), c("diet", "height", "location", "date") ) expect_equal( terms_select(info = info1, quos(date, -age)), "date" ) expect_error(terms_select(info = info1, quos(log(date)))) expect_error(terms_select(info = info1, quos(date:age))) expect_error(terms_select(info = info1, quos(I(date:age)))) expect_error(terms_select(info = info1, quos(matches("blahblahblah")))) expect_error(terms_select(info = info1)) }) test_that('combinations', { expect_equal( terms_select(info = info2, quos(matches("[hH]"), -all_outcomes())), "hydrogen" ) expect_equal( terms_select(info = info2, quos(all_numeric(), -all_predictors())), "HHV" ) expect_equal( terms_select(info = info2, quos(all_numeric(), -all_predictors(), dataset)), c("HHV", "dataset") ) expect_equal( terms_select(info = info2, quos(all_numeric(), -all_predictors(), dataset, -dataset)), "HHV" ) }) recipes/tests/testthat/test_isomap.R 0000644 0001777 0001777 00000002656 13135741217 020727 0 ustar herbrandt herbrandt library(testthat) library(recipes) ## expected results form the `dimRed` package exp_res <- structure(list(Isomap1 = c(0.312570873898531, 0.371885353599467, 2.23124009833741, 0.248271457498181, -0.420128801874122), Isomap2 = c(-0.443724171391742, -0.407721529759647, 0.245721022395862, 3.112001672258, 0.0292770508011519), Isomap3 = c(0.761529345514676, 0.595015565588918, 1.59943072269788, 0.566884409484389, 1.53770327701819)), .Names = c("Isomap1","Isomap2", "Isomap3"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L)) set.seed(1) dat1 <- matrix(rnorm(15), ncol = 3) dat2 <- matrix(rnorm(15), ncol = 3) colnames(dat1) <- paste0("x", 1:3) colnames(dat2) <- paste0("x", 1:3) rec <- recipe( ~ ., data = dat1) test_that('correct Isomap values', { skip_on_cran() im_rec <- rec %>% step_isomap(x1, x2, x3, options = list(knn = 3), num = 3) im_trained <- prep(im_rec, training = dat1, verbose = FALSE) im_pred <- bake(im_trained, newdata = dat2) all.equal(as.matrix(im_pred), as.matrix(exp_res)) }) test_that('printing', { im_rec <- rec %>% step_isomap(x1, x2, x3, options = list(knn = 3), num = 3) expect_output(print(im_rec)) expect_output(prep(im_rec, training = dat1)) }) recipes/tests/testthat/test_multivariate.R 0000644 0001777 0001777 00000001521 13135741217 022133 0 ustar herbrandt herbrandt library(tibble) library(recipes) data("biomass") test_that('multivariate outcome', { raw_recipe <- recipe(carbon + hydrogen ~ oxygen + nitrogen + sulfur, data = biomass) rec <- raw_recipe %>% step_center(all_outcomes()) %>% step_scale(all_predictors()) rec_trained <- prep(rec, training = biomass) results <- bake(rec_trained, head(biomass)) exp_res <- biomass pred <- c("oxygen", "nitrogen", "sulfur") outcome <- c("carbon", "hydrogen") for(i in pred) exp_res[,i] <- exp_res[,i]/sd(exp_res[,i]) for(i in outcome) exp_res[,i] <- exp_res[,i]-mean(exp_res[,i]) expect_equal(rec$term_info$variable[rec$term_info$role == "outcome"], outcome) expect_equal(rec$term_info$variable[rec$term_info$role == "predictor"], pred) expect_equal(exp_res[1:6, colnames(results)], as.data.frame(results)) }) recipes/tests/testthat/test_classdist.R 0000644 0001777 0001777 00000003006 13135741217 021416 0 ustar herbrandt herbrandt library(testthat) library(recipes) test_that("defaults", { rec <- recipe(Species ~ ., data = iris) %>% step_classdist(all_predictors(), class = "Species", log = FALSE) trained <- prep(rec, training = iris, verbose = FALSE) dists <- bake(trained, newdata = iris) dists <- dists[, grepl("classdist", names(dists))] dists <- as.data.frame(dists) split_up <- split(iris[, 1:4], iris$Species) mahalanobis2 <- function(x, y) mahalanobis(y, center = colMeans(x), cov = cov(x)) exp_res <- lapply(split_up, mahalanobis2, y = iris[, 1:4]) exp_res <- as.data.frame(exp_res) for(i in 1:ncol(exp_res)) expect_equal(dists[, i], exp_res[, i]) }) test_that("alt args", { rec <- recipe(Species ~ ., data = iris) %>% step_classdist(all_predictors(), class = "Species", log = FALSE, mean_func = median) trained <- prep(rec, training = iris, verbose = FALSE) dists <- bake(trained, newdata = iris) dists <- dists[, grepl("classdist", names(dists))] dists <- as.data.frame(dists) split_up <- split(iris[, 1:4], iris$Species) mahalanobis2 <- function(x, y) mahalanobis(y, center = apply(x, 2, median), cov = cov(x)) exp_res <- lapply(split_up, mahalanobis2, y = iris[, 1:4]) exp_res <- as.data.frame(exp_res) for(i in 1:ncol(exp_res)) expect_equal(dists[, i], exp_res[, i]) }) test_that('printing', { rec <- recipe(Species ~ ., data = iris) %>% step_classdist(all_predictors(), class = "Species", log = FALSE) expect_output(print(rec)) expect_output(prep(rec, training = iris)) }) recipes/tests/testthat/test_shuffle.R 0000644 0001777 0001777 00000003450 13135741217 021064 0 ustar herbrandt herbrandt library(testthat) library(recipes) n <- 50 set.seed(424) dat <- data.frame( x1 = sort(rnorm(n)), x2 = sort(rep(1:5, each = 10)), x3 = sort(factor(rep(letters[1:3], c(2, 2, 46)))), x4 = 1, y = sort(runif(n)) ) test_that('numeric data', { rec1 <- recipe(y ~ ., data = dat) %>% step_shuffle(all_numeric()) rec1 <- prep(rec1, training = dat, verbose = FALSE) set.seed(7046) dat1 <- bake(rec1, dat) exp1 <- c(FALSE, FALSE, TRUE, TRUE) obs1 <- rep(NA, 4) for (i in 1:ncol(dat1)) obs1[i] <- isTRUE(all.equal(dat[, i], getElement(dat1, names(dat)[i]))) expect_equal(exp1, obs1) }) test_that('nominal data', { rec2 <- recipe(y ~ ., data = dat) %>% step_shuffle(all_nominal()) rec2 <- prep(rec2, training = dat, verbose = FALSE) set.seed(804) dat2 <- bake(rec2, dat) exp2 <- c(TRUE, TRUE, FALSE, TRUE) obs2 <- rep(NA, 4) for (i in 1:ncol(dat2)) obs2[i] <- isTRUE(all.equal(dat[, i], getElement(dat2, names(dat)[i]))) expect_equal(exp2, obs2) }) test_that('all data', { rec3 <- recipe(y ~ ., data = dat) %>% step_shuffle(everything()) rec3 <- prep(rec3, training = dat, verbose = FALSE) set.seed(2516) dat3 <- bake(rec3, dat) exp3 <- c(FALSE, FALSE, FALSE, TRUE) obs3 <- rep(NA, 4) for (i in 1:ncol(dat3)) obs3[i] <- isTRUE(all.equal(dat[, i], getElement(dat3, names(dat)[i]))) expect_equal(exp3, obs3) }) test_that('printing', { rec3 <- recipe(y ~ ., data = dat) %>% step_shuffle(everything()) expect_output(print(rec3)) expect_output(prep(rec3, training = dat)) }) test_that('bake a single row', { rec4 <- recipe(y ~ ., data = dat) %>% step_shuffle(everything()) rec4 <- prep(rec4, training = dat, verbose = FALSE) expect_warning(dat4 <- bake(rec4, dat[1,], everything())) expect_equal(dat4, dat[1,]) }) recipes/tests/testthat/test-basics.R 0000644 0001777 0001777 00000003254 13135757306 020622 0 ustar herbrandt herbrandt library(testthat) context("Testing basic functionalities") library(tibble) library(recipes) data("biomass") test_that("Recipe correctly identifies output variable", { raw_recipe <- recipe(HHV ~ ., data = biomass) var_info <- raw_recipe$var_info expect_true(is.tibble(var_info)) outcome_ind <- which(var_info$variable == "HHV") expect_true(var_info$role[outcome_ind] == "outcome") expect_true(all(var_info$role[-outcome_ind] == rep("predictor", ncol(biomass) - 1))) }) test_that("Recipe fails on in-line functions", { expect_error(recipe(HHV ~ log(nitrogen), data = biomass)) expect_error(recipe(HHV ~ (.)^2, data = biomass)) expect_error(recipe(HHV ~ nitrogen + sulfur + nitrogen:sulfur, data = biomass)) expect_error(recipe(HHV ~ nitrogen^2, data = biomass)) }) test_that("return character or factor values", { raw_recipe <- recipe(HHV ~ ., data = biomass) centered <- raw_recipe %>% step_center(carbon, hydrogen, oxygen, nitrogen, sulfur) centered_char <- prep(centered, training = biomass, stringsAsFactors = FALSE, retain = TRUE) char_var <- bake(centered_char, newdata = head(biomass)) expect_equal(class(char_var$sample), "character") centered_fac <- prep(centered, training = biomass, stringsAsFactors = TRUE, retain = TRUE) fac_var <- bake(centered_fac, newdata = head(biomass)) expect_equal(class(fac_var$sample), "factor") expect_equal(levels(fac_var$sample), sort(unique(biomass$sample))) }) test_that("Using prepare", { expect_error(prepare(recipe(HHV ~ ., data = biomass), training = biomass), paste0("As of version 0.0.1.9006, used `prep` ", "instead of `prepare`")) }) recipes/tests/testthat/test_date.R 0000644 0001777 0001777 00000006444 13135741217 020353 0 ustar herbrandt herbrandt library(testthat) library(recipes) library(lubridate) library(tibble) examples <- data.frame(Dan = ymd("2002-03-04") + days(1:10), Stefan = ymd("2006-01-13") + days(1:10)) examples$Dan <- as.POSIXct(examples$Dan) date_rec <- recipe(~ Dan + Stefan, examples) %>% step_date(all_predictors()) feats <- c("year", "doy", "week", "decimal", "semester", "quarter", "dow", "month") test_that('default option', { date_rec <- recipe(~ Dan + Stefan, examples) %>% step_date(all_predictors(), features = feats) date_rec <- prep(date_rec, training = examples) date_res <- bake(date_rec, newdata = examples) date_exp <- tibble( Dan = examples$Dan, Stefan = examples$Stefan, Dan_year = year(examples$Dan), Dan_doy = yday(examples$Dan), Dan_week = week(examples$Dan), Dan_decimal = decimal_date(examples$Dan), Dan_semester = semester(examples$Dan), Dan_quarter = quarter(examples$Dan), Dan_dow = wday(examples$Dan, label = TRUE, abbr = TRUE), Dan_month = month(examples$Dan, label = TRUE, abbr = TRUE), Stefan_year = year(examples$Stefan), Stefan_doy = yday(examples$Stefan), Stefan_week = week(examples$Stefan), Stefan_decimal = decimal_date(examples$Stefan), Stefan_semester = semester(examples$Stefan), Stefan_quarter = quarter(examples$Stefan), Stefan_dow = wday(examples$Stefan, label = TRUE, abbr = TRUE), Stefan_month = month(examples$Stefan, label = TRUE, abbr = TRUE) ) date_exp$Dan_dow <- factor(as.character(date_exp$Dan_dow), levels = levels(date_exp$Dan_dow)) date_exp$Dan_month <- factor(as.character(date_exp$Dan_month), levels = levels(date_exp$Dan_month)) date_exp$Stefan_dow <- factor(as.character(date_exp$Stefan_dow), levels = levels(date_exp$Stefan_dow)) date_exp$Stefan_month <- factor(as.character(date_exp$Stefan_month), levels = levels(date_exp$Stefan_month)) expect_equal(date_res, date_exp) }) test_that('nondefault options', { date_rec <- recipe(~ Dan + Stefan, examples) %>% step_date(all_predictors(), features = c("dow", "month"), label = FALSE) date_rec <- prep(date_rec, training = examples) date_res <- bake(date_rec, newdata = examples) date_exp <- tibble( Dan = examples$Dan, Stefan = examples$Stefan, Dan_dow = wday(examples$Dan, label = FALSE), Dan_month = month(examples$Dan, label = FALSE), Stefan_dow = wday(examples$Stefan, label = FALSE), Stefan_month = month(examples$Stefan, label = FALSE) ) expect_equal(date_res, date_exp) }) test_that('ordinal values', { date_rec <- recipe(~ Dan + Stefan, examples) %>% step_date(all_predictors(), features = c("dow", "month"), ordinal = TRUE) date_rec <- prep(date_rec, training = examples) date_res <- bake(date_rec, newdata = examples) date_exp <- tibble( Dan = examples$Dan, Stefan = examples$Stefan, Dan_dow = wday(examples$Dan, label = TRUE), Dan_month = month(examples$Dan, label = TRUE), Stefan_dow = wday(examples$Stefan, label = TRUE), Stefan_month = month(examples$Stefan, label = TRUE) ) expect_equal(date_res, date_exp) }) test_that('printing', { date_rec <- recipe(~ Dan + Stefan, examples) %>% step_date(all_predictors(), features = feats) expect_output(print(date_rec)) expect_output(prep(date_rec, training = examples)) }) recipes/tests/testthat/test_roles.R 0000644 0001777 0001777 00000002277 13125447176 020570 0 ustar herbrandt herbrandt library(testthat) library(recipes) library(tibble) data(biomass) test_that('default method', { rec <- recipe(x = biomass) exp_res <- tibble(variable = colnames(biomass), type = rep(c("nominal", "numeric"), c(2, 6)), role = NA, source = "original") expect_equal(summary(rec, TRUE), exp_res) }) test_that('changing roles', { rec <- recipe(x = biomass) rec <- add_role(rec, sample, new_role = "some other role") exp_res <- tibble(variable = colnames(biomass), type = rep(c("nominal", "numeric"), c(2, 6)), role = rep(c("some other role", NA), c(1, 7)), source = "original") expect_equal(summary(rec, TRUE), exp_res) }) test_that('change existing role', { rec <- recipe(x = biomass) rec <- add_role(rec, sample, new_role = "some other role") rec <- add_role(rec, sample, new_role = "other other role") exp_res <- tibble(variable = colnames(biomass), type = rep(c("nominal", "numeric"), c(2, 6)), role = rep(c("other other role", NA), c(1, 7)), source = "original") expect_equal(summary(rec, TRUE), exp_res) }) recipes/tests/testthat/test_intercept.R 0000644 0001777 0001777 00000002764 13135741217 021434 0 ustar herbrandt herbrandt library(testthat) library(recipes) library(tibble) ex_dat <- data.frame(cat = rep(c("A", "B"), each = 5), numer = 1:10) test_that('add appropriate column with default settings', { rec <- recipe(~ ., data = ex_dat) %>% step_intercept() rec_trained <- prep(rec, training = ex_dat, verbose = FALSE) rec_trans <- bake(rec_trained, newdata = ex_dat) exp_res <- tibble::add_column(ex_dat, "intercept" = 1, .before = TRUE) expect_equal(rec_trans, exp_res) }) test_that('adds arbitrary numeric column', { rec <- recipe(~ ., data = ex_dat) %>% step_intercept(name = "(Intercept)", value = 2.5) rec_trained <- prep(rec, training = ex_dat, verbose = FALSE) rec_trans <- bake(rec_trained, newdata = ex_dat) exp_res <- tibble::add_column(ex_dat, "(Intercept)" = 2.5, .before = TRUE) expect_equal(rec_trans, exp_res) }) test_that('deals with bad input', { expect_error( recipe(~ ., data = ex_dat) %>% step_intercept(value = "Pie") %>% prep(), "Intercept value must be numeric." ) expect_error( recipe(~ ., data = ex_dat) %>% step_intercept(name = 4) %>% prep(), "Intercept/constant column name must be a character value." ) expect_warning( recipe(~ ., data = ex_dat) %>% step_intercept(all_predictors()) %>% prep(), "Selectors are not used for this step." ) }) test_that('printing', { rec <- recipe(~ ., data = ex_dat) %>% step_intercept() expect_output(print(rec)) expect_output(prep(rec, training = ex_dat)) }) recipes/tests/testthat/test_nzv.R 0000644 0001777 0001777 00000003074 13135741217 020247 0 ustar herbrandt herbrandt library(testthat) library(recipes) n <- 50 set.seed(424) dat <- data.frame(x1 = rnorm(n), x2 = rep(1:5, each = 10), x3 = factor(rep(letters[1:3], c(2, 2, 46))), x4 = 1, y = runif(n)) ratios <- function(x) { tab <- sort(table(x), decreasing = TRUE) if(length(tab) > 1) tab[1]/tab[2] else Inf } pct_uni <- vapply(dat[, -5], function(x) length(unique(x)), c(val = 0))/nrow(dat)*100 f_ratio <- vapply(dat[, -5], ratios, c(val = 0)) vars <- names(pct_uni) test_that('nzv filtering', { rec <- recipe(y ~ ., data = dat) filtering <- rec %>% step_nzv(x1, x2, x3, x4) filtering_trained <- prep(filtering, training = dat, verbose = FALSE) removed <- vars[ pct_uni <= filtering_trained$steps[[1]]$options$unique_cut & f_ratio >= filtering_trained$steps[[1]]$options$freq_cut] expect_equal(filtering_trained$steps[[1]]$removals, removed) }) test_that('altered options', { rec <- recipe(y ~ ., data = dat) filtering <- rec %>% step_nzv(x1, x2, x3, x4, options = list(freq_cut = 50, unique_cut = 10)) filtering_trained <- prep(filtering, training = dat, verbose = FALSE) removed <- vars[ pct_uni <= filtering_trained$steps[[1]]$options$unique_cut & f_ratio >= filtering_trained$steps[[1]]$options$freq_cut] expect_equal(filtering_trained$steps[[1]]$removals, removed) }) test_that('printing', { rec <- recipe(y ~ ., data = dat) %>% step_nzv(x1, x2, x3, x4) expect_output(print(rec)) expect_output(prep(rec, training = dat)) }) recipes/tests/testthat/test_logit.R 0000644 0001777 0001777 00000001453 13135741217 020547 0 ustar herbrandt herbrandt library(testthat) library(recipes) library(tibble) n <- 20 set.seed(12) ex_dat <- data.frame(x1 = runif(n), x2 = rnorm(n)) test_that('simple logit trans', { rec <- recipe(~., data = ex_dat) %>% step_logit(x1) rec_trained <- prep(rec, training = ex_dat, verbose = FALSE) rec_trans <- bake(rec_trained, newdata = ex_dat) exp_res <- as_tibble(ex_dat) exp_res$x1 <- binomial()$linkfun(exp_res$x1) expect_equal(rec_trans, exp_res) }) test_that('out of bounds logit trans', { rec <- recipe(~., data = ex_dat) %>% step_logit(x1, x2) expect_error(prep(rec, training = ex_dat, verbose = FALSE)) }) test_that('printing', { rec <- recipe(~., data = ex_dat) %>% step_logit(x1) expect_output(print(rec)) expect_output(prep(rec, training = ex_dat)) }) recipes/tests/testthat/test_corr.R 0000644 0001777 0001777 00000001757 13135741217 020405 0 ustar herbrandt herbrandt library(testthat) library(recipes) n <- 100 set.seed(424) dat <- matrix(rnorm(n*5), ncol = 5) dat <- as.data.frame(dat) dat$duplicate <- dat$V1 dat$V6 <- -dat$V2 + runif(n)*.2 test_that('high filter', { set.seed(1) rec <- recipe(~ ., data = dat) filtering <- rec %>% step_corr(all_predictors(), threshold = .5) filtering_trained <- prep(filtering, training = dat, verbose = FALSE) removed <- c("V6", "V1") expect_equal(filtering_trained$steps[[1]]$removals, removed) }) test_that('low filter', { rec <- recipe(~ ., data = dat) filtering <- rec %>% step_corr(all_predictors(), threshold = 1) filtering_trained <- prep(filtering, training = dat, verbose = FALSE) expect_equal(filtering_trained$steps[[1]]$removals, numeric(0)) }) test_that('printing', { set.seed(1) rec <- recipe(~ ., data = dat) filtering <- rec %>% step_corr(all_predictors(), threshold = .5) expect_output(print(filtering)) expect_output(prep(filtering, training = dat)) }) recipes/tests/testthat/test_retraining.R 0000644 0001777 0001777 00000001742 13135741217 021574 0 ustar herbrandt herbrandt context("Testing retraining") data(biomass) rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass) test_that('training in stages', { skip_on_cran() at_once <- rec %>% step_center(carbon, hydrogen, oxygen, nitrogen, sulfur) %>% step_scale(carbon, hydrogen, oxygen, nitrogen, sulfur) at_once_trained <- prep(at_once, training = biomass, verbose = FALSE) ## not train in stages center_first <- rec %>% step_center(carbon, hydrogen, oxygen, nitrogen, sulfur) center_first_trained <- prep(center_first, training = biomass, verbose = FALSE) in_stages <- center_first_trained %>% step_scale(carbon, hydrogen, oxygen, nitrogen, sulfur) in_stages_trained <- prep(in_stages, training = biomass, verbose = FALSE) in_stages_retrained <- prep(in_stages, training = biomass, verbose = FALSE, fresh = TRUE) expect_equal(at_once_trained, in_stages_trained) expect_equal(at_once_trained, in_stages_retrained) }) recipes/tests/testthat/test_roll.R 0000644 0001777 0001777 00000005042 13135741217 020377 0 ustar herbrandt herbrandt library(testthat) library(recipes) library(tibble) set.seed(5522) sim_dat <- data.frame(x1 = (20:100) / 10) n <- nrow(sim_dat) sim_dat$y1 <- sin(sim_dat$x1) + rnorm(n, sd = 0.1) sim_dat$y2 <- cos(sim_dat$x1) + rnorm(n, sd = 0.1) sim_dat$x2 <- runif(n) sim_dat$x3 <- rnorm(n) sim_dat$fac <- sample(letters[1:3], size = n, replace = TRUE) rec <- recipe( ~ ., data = sim_dat) test_that('error checks', { expect_error(rec %>% step_window()) expect_error(rec %>% step_window(y1, size = 6)) expect_error(rec %>% step_window(y1, size = NA)) expect_error(rec %>% step_window(y1, size = NULL)) expect_error(rec %>% step_window(y1, statistic = "average")) expect_error(rec %>% step_window(y1, size = 1)) expect_error(rec %>% step_window(y1, size = 2)) expect_error(rec %>% step_window(y1, size = -1)) expect_warning(rec %>% step_window(y1, size = pi)) expect_error(prep(rec %>% step_window(fac), training = sim_dat)) expect_error(prep(rec %>% step_window(y1, size = 1000L), training = sim_dat)) bad_names <- rec %>% step_window(starts_with("y"), names = "only_one_name") expect_error(prep(bad_names, training = sim_dat)) }) test_that('basic moving average', { simple_ma <- rec %>% step_window(starts_with("y")) simple_ma <- prep(simple_ma, training = sim_dat) simple_ma_res <- bake(simple_ma, newdata = sim_dat) expect_equal(names(sim_dat), names(simple_ma_res)) for (i in 2:(n - 1)) { expect_equal(simple_ma_res$y1[i], mean(sim_dat$y1[(i - 1):(i + 1)])) expect_equal(simple_ma_res$y2[i], mean(sim_dat$y2[(i - 1):(i + 1)])) } expect_equal(simple_ma_res$y1[1], mean(sim_dat$y1[1:3])) expect_equal(simple_ma_res$y2[1], mean(sim_dat$y2[1:3])) expect_equal(simple_ma_res$y1[n], mean(sim_dat$y1[(n - 2):n])) expect_equal(simple_ma_res$y2[n], mean(sim_dat$y2[(n - 2):n])) }) test_that('creating new variables', { new_names <- rec %>% step_window(starts_with("y"), names = paste0("new", 1:2), role = "predictor") new_names <- prep(new_names, training = sim_dat) new_names_res <- bake(new_names, newdata = sim_dat) simple_ma <- rec %>% step_window(starts_with("y")) simple_ma <- prep(simple_ma, training = sim_dat) simple_ma_res <- bake(simple_ma, newdata = sim_dat) expect_equal(new_names_res$new1, simple_ma_res$y1) expect_equal(new_names_res$new2, simple_ma_res$y2) }) test_that('printing', { new_names <- rec %>% step_window(starts_with("y"), names = paste0("new", 1:2), role = "predictor") expect_output(print(new_names)) expect_output(prep(new_names, training = sim_dat)) }) recipes/tests/testthat/test_log.R 0000644 0001777 0001777 00000002034 13135741217 020206 0 ustar herbrandt herbrandt library(testthat) library(recipes) library(tibble) n <- 20 set.seed(1) ex_dat <- data.frame(x1 = exp(rnorm(n, mean = .1)), x2 = 1/abs(rnorm(n)), x3 = rep(1:2, each = n/2), x4 = rexp(n)) test_that('simple log trans', { rec <- recipe(~., data = ex_dat) %>% step_log(x1, x2, x3, x4) rec_trained <- prep(rec, training = ex_dat, verbose = FALSE) rec_trans <- bake(rec_trained, newdata = ex_dat) exp_res <- as_tibble(lapply(ex_dat, log)) expect_equal(rec_trans, exp_res) }) test_that('alt base', { rec <- recipe(~., data = ex_dat) %>% step_log(x1, x2, x3, x4, base = pi) rec_trained <- prep(rec, training = ex_dat, verbose = FALSE) rec_trans <- bake(rec_trained, newdata = ex_dat) exp_res <- as_tibble(lapply(ex_dat, log, base = pi)) expect_equal(rec_trans, exp_res) }) test_that('printing', { rec <- recipe(~., data = ex_dat) %>% step_log(x1, x2, x3, x4) expect_output(print(rec)) expect_output(prep(rec, training = ex_dat)) }) recipes/tests/testthat/test_pca.R 0000644 0001777 0001777 00000004710 13135741217 020173 0 ustar herbrandt herbrandt library(testthat) library(recipes) data(biomass) biomass_tr <- biomass[biomass$dataset == "Training",] biomass_te <- biomass[biomass$dataset == "Testing",] rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass_tr) test_that('correct PCA values', { pca_extract <- rec %>% step_center(carbon, hydrogen, oxygen ,nitrogen, sulfur) %>% step_scale(carbon, hydrogen, oxygen ,nitrogen, sulfur) %>% step_pca(carbon, hydrogen, oxygen, nitrogen, sulfur, options = list(retx = TRUE)) pca_extract_trained <- prep(pca_extract, training = biomass_tr, verbose = FALSE) pca_pred <- bake(pca_extract_trained, newdata = biomass_te) pca_pred <- as.matrix(pca_pred) pca_exp <- prcomp(biomass_tr[, 3:7], center = TRUE, scale. = TRUE, retx = TRUE) pca_pred_exp <- predict(pca_exp, biomass_te[, 3:7])[, 1:pca_extract$steps[[3]]$num] rownames(pca_pred) <- NULL rownames(pca_pred_exp) <- NULL expect_equal(pca_pred, pca_pred_exp) }) test_that('correct PCA values with threshold', { pca_extract <- rec %>% step_center(carbon, hydrogen, oxygen ,nitrogen, sulfur) %>% step_scale(carbon, hydrogen, oxygen ,nitrogen, sulfur) %>% step_pca(carbon, hydrogen, oxygen, nitrogen, sulfur, threshold = .5) pca_extract_trained <- prep(pca_extract, training = biomass_tr, verbose = FALSE) pca_exp <- prcomp(biomass_tr[, 3:7], center = TRUE, scale. = TRUE, retx = TRUE) # cumsum(pca_exp$sdev^2)/sum(pca_exp$sdev^2) expect_equal(pca_extract_trained$steps[[3]]$num, 2) }) test_that('Reduced rotation size', { pca_extract <- rec %>% step_center(carbon, hydrogen, oxygen ,nitrogen, sulfur) %>% step_scale(carbon, hydrogen, oxygen ,nitrogen, sulfur) %>% step_pca(carbon, hydrogen, oxygen, nitrogen, sulfur, num = 3) pca_extract_trained <- prep(pca_extract, training = biomass_tr, verbose = FALSE) pca_pred <- bake(pca_extract_trained, newdata = biomass_te) pca_pred <- as.matrix(pca_pred) pca_exp <- prcomp(biomass_tr[, 3:7], center = TRUE, scale. = TRUE, retx = TRUE) pca_pred_exp <- predict(pca_exp, biomass_te[, 3:7])[, 1:3] rownames(pca_pred_exp) <- NULL rownames(pca_pred) <- NULL rownames(pca_pred_exp) <- NULL expect_equal(pca_pred, pca_pred_exp) }) test_that('printing', { pca_extract <- rec %>% step_pca(carbon, hydrogen, oxygen, nitrogen, sulfur) expect_output(print(pca_extract)) expect_output(prep(pca_extract, training = biomass_tr)) }) recipes/tests/testthat/test_knnimpute.R 0000644 0001777 0001777 00000003615 13135741217 021445 0 ustar herbrandt herbrandt library(testthat) library(gower) library(recipes) library(dplyr) data("biomass") rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur, data = biomass) biomass_tr <- biomass[biomass$dataset == "Training", ] biomass_te <- biomass[biomass$dataset == "Testing", ] # induce some missing data at random set.seed(9039) carb_missing <- sample(1:nrow(biomass_te), 3) nitro_missing <- sample(1:nrow(biomass_te), 3) biomass_te$carbon[carb_missing] <- NA biomass_te$nitrogen[nitro_missing] <- NA test_that('imputation values', { discr_rec <- rec %>% step_discretize(nitrogen, options = list(keep_na = FALSE)) impute_rec <- discr_rec %>% step_knnimpute(carbon, nitrogen, impute_with = imp_vars(hydrogen, oxygen, nitrogen), K = 3) discr_rec <- prep(discr_rec, training = biomass_tr, verbose = FALSE) tr_data <- bake(discr_rec, newdata = biomass_tr) te_data <- bake(discr_rec, newdata = biomass_te) %>% dplyr::select(hydrogen, oxygen, nitrogen, carbon) nn <- gower_topn(te_data[, c("hydrogen", "oxygen", "nitrogen")], tr_data[, c("hydrogen", "oxygen", "nitrogen")], n = 3)$index impute_rec <- prep(impute_rec, training = biomass_tr, verbose = FALSE) imputed_te <- bake(impute_rec, newdata = biomass_te) for(i in carb_missing) { nn_tr_ind <- nn[, i] nn_tr_data <- tr_data$carbon[nn_tr_ind] expect_equal(imputed_te$carbon[i], mean(nn_tr_data)) } for(i in nitro_missing) { nn_tr_ind <- nn[, i] nn_tr_data <- tr_data$nitrogen[nn_tr_ind] expect_equal(as.character(imputed_te$nitrogen[i]), recipes:::mode_est(nn_tr_data)) } }) test_that('printing', { discr_rec <- rec %>% step_discretize(nitrogen, options = list(keep_na = FALSE)) expect_output(print(discr_rec)) expect_output(prep(discr_rec, training = biomass_tr)) }) recipes/tests/testthat/test_bin2factor.R 0000644 0001777 0001777 00000002157 13135741217 021464 0 ustar herbrandt herbrandt library(testthat) library(recipes) data(covers) rec <- recipe(~ description, covers) %>% step_regex(description, pattern = "(rock|stony)", result = "rocks") %>% step_regex(description, pattern = "(rock|stony)", result = "more_rocks") test_that('default options', { rec1 <- rec %>% step_bin2factor(rocks) rec1 <- prep(rec1, training = covers) res1 <- bake(rec1, newdata = covers) expect_true(all(diag(table(res1$rocks, res1$more_rocks)) == 0)) }) test_that('nondefault options', { rec2 <- rec %>% step_bin2factor(rocks, levels = letters[2:1]) rec2 <- prep(rec2, training = covers) res2 <- bake(rec2, newdata = covers) expect_true(all(diag(table(res2$rocks, res2$more_rocks)) == 0)) }) test_that('bad options', { rec3 <- rec %>% step_bin2factor(description) expect_error(prep(rec3, training = covers)) expect_error(rec %>% step_bin2factor(rocks, levels = letters[1:5])) expect_error(rec %>% step_bin2factor(rocks, levels = 1:2)) }) test_that('printing', { rec2 <- rec %>% step_bin2factor(rocks, levels = letters[2:1]) expect_output(print(rec2)) expect_output(prep(rec2, training = covers)) }) recipes/tests/testthat/test_range.R 0000644 0001777 0001777 00000006070 13135741217 020525 0 ustar herbrandt herbrandt library(testthat) library(recipes) data(biomass) biomass_tr <- biomass[1:10,] biomass_te <- biomass[c(13:14, 19, 522),] rec <- recipe(HHV ~ carbon + hydrogen, data = biomass_tr) test_that('correct values', { standardized <- rec %>% step_range(carbon, hydrogen, min = -12) standardized_trained <- prep(standardized, training = biomass_tr, verbose = FALSE) obs_pred <- bake(standardized_trained, newdata = biomass_te) obs_pred <- as.matrix(obs_pred) mins <- apply(biomass_tr[, c("carbon", "hydrogen")], 2, min) maxs <- apply(biomass_tr[, c("carbon", "hydrogen")], 2, max) new_min <- -12 new_max <- 1 new_range <- new_max - new_min carb <- ((new_range * (biomass_te$carbon - mins["carbon"])) / (maxs["carbon"] - mins["carbon"])) + new_min carb <- ifelse(carb > new_max, new_max, carb) carb <- ifelse(carb < new_min, new_min, carb) hydro <- ((new_range * (biomass_te$hydrogen - mins["hydrogen"])) / (maxs["hydrogen"] - mins["hydrogen"])) + new_min hydro <- ifelse(hydro > new_max, new_max, hydro) hydro <- ifelse(hydro < new_min, new_min, hydro) exp_pred <- cbind(carb, hydro) colnames(exp_pred) <- c("carbon", "hydrogen") expect_equal(exp_pred, obs_pred) }) test_that('defaults', { standardized <- rec %>% step_range(carbon, hydrogen) standardized_trained <- prep(standardized, training = biomass_tr, verbose = FALSE) obs_pred <- bake(standardized_trained, newdata = biomass_te) obs_pred <- as.matrix(obs_pred) mins <- apply(biomass_tr[, c("carbon", "hydrogen")], 2, min) maxs <- apply(biomass_tr[, c("carbon", "hydrogen")], 2, max) new_min <- 0 new_max <- 1 new_range <- new_max - new_min carb <- ((new_range * (biomass_te$carbon - mins["carbon"])) / (maxs["carbon"] - mins["carbon"])) + new_min carb <- ifelse(carb > new_max, new_max, carb) carb <- ifelse(carb < new_min, new_min, carb) hydro <- ((new_range * (biomass_te$hydrogen - mins["hydrogen"])) / (maxs["hydrogen"] - mins["hydrogen"])) + new_min hydro <- ifelse(hydro > new_max, new_max, hydro) hydro <- ifelse(hydro < new_min, new_min, hydro) exp_pred <- cbind(carb, hydro) colnames(exp_pred) <- c("carbon", "hydrogen") expect_equal(exp_pred, obs_pred) }) test_that('one variable', { standardized <- rec %>% step_range(carbon) standardized_trained <- prep(standardized, training = biomass_tr, verbose = FALSE) obs_pred <- bake(standardized_trained, newdata = biomass_te) mins <- min(biomass_tr$carbon) maxs <- max(biomass_tr$carbon) new_min <- 0 new_max <- 1 new_range <- new_max - new_min carb <- ((new_range * (biomass_te$carbon - mins)) / (maxs - mins)) + new_min carb <- ifelse(carb > new_max, new_max, carb) carb <- ifelse(carb < new_min, new_min, carb) expect_equal(carb, obs_pred$carbon) }) test_that('printing', { standardized <- rec %>% step_range(carbon, hydrogen, min = -12) expect_output(print(standardized)) expect_output(prep(standardized, training = biomass_tr)) }) recipes/tests/testthat/test_BoxCox.R 0000644 0001777 0001777 00000005763 13135741217 020643 0 ustar herbrandt herbrandt library(testthat) library(recipes) n <- 20 set.seed(1) ex_dat <- data.frame(x1 = exp(rnorm(n, mean = .1)), x2 = 1/rnorm(n), x3 = rep(1:2, each = n/2), x4 = rexp(n)) ## from `car` package exp_lambda <- c(x1 = 0.2874304685, x2 = NA, x3 = NA, x4 = 0.06115365314) exp_dat <- structure(list(x1 = c(-0.48855792533959, 0.295526451871788, -0.66306066037752, 2.18444062220084, 0.45714544418559, -0.650762952308473, 0.639934327981261, 0.94795174900382, 0.745877376631664, -0.199443408020842, 2.05013184840922, 0.526004196848377, -0.484073411411316, -1.5846209165316, 1.46827089088108, 0.0555044880684726, 0.0848273579417863, 1.21733702306844, 1.05470177834901, 0.76793945044649), x2 = c(1.0881660755694, 1.27854953038913, 13.4111208085756, -0.502676325196487, 1.61335666257264, -17.8161848705567, -6.41867035287092, -0.679924106156326, -2.09139367300257, 2.39267901359744, 0.736008721758276, -9.72878791903891, 2.57950278065913, -18.5856192870844, -0.726185004156987, -2.40967012205861, -2.5362046143702, -16.8595975858421, 0.909069940992826, 1.31031417340121), x3 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), x4 = c(-0.0299153493217198, -0.00545480495048682, -0.605467890118739, 0.771791879612809, -0.763649380406524, 0.872804671752781, 1.38894407918253, -0.537364454265797, -0.482864603899052, -0.0227886234018179, -1.25797709152009, -0.995703197045091, 0.102163556869708, -0.246753343931442, -1.7395729395129, 0.104247324965852, -1.15077903230011, 0.48306309307708, 1.99265865015763, -0.747338829803379)), .Names = c("x1", "x2", "x3", "x4"), row.names = c(NA, -20L), class = "data.frame") test_that('simple Box Cox', { rec <- recipe(~., data = ex_dat) %>% step_BoxCox(x1, x2, x3, x4) rec_trained <- prep(rec, training = ex_dat, verbose = FALSE) rec_trans <- bake(rec_trained, newdata = ex_dat) expect_equal(names(exp_lambda)[!is.na(exp_lambda)], names(rec_trained$steps[[1]]$lambdas)) expect_equal(exp_lambda[!is.na(exp_lambda)], rec_trained$steps[[1]]$lambdas, tol = .001) expect_equal(as.matrix(exp_dat), as.matrix(rec_trans), tol = .05) }) test_that('printing', { rec <- recipe(~., data = ex_dat) %>% step_BoxCox(x1, x2, x3, x4) expect_output(print(rec)) expect_output(prep(rec, training = ex_dat)) }) recipes/tests/testthat/test_invlogit.R 0000644 0001777 0001777 00000001202 13135741217 021254 0 ustar herbrandt herbrandt library(testthat) library(recipes) library(tibble) n <- 20 set.seed(12) ex_dat <- data.frame(x1 = rnorm(n), x2 = runif(n)) test_that('simple logit trans', { rec <- recipe(~., data = ex_dat) %>% step_invlogit(x1) rec_trained <- prep(rec, training = ex_dat, verbose = FALSE) rec_trans <- bake(rec_trained, newdata = ex_dat) exp_res <- as_tibble(ex_dat) exp_res$x1 <- binomial()$linkinv(exp_res$x1) expect_equal(rec_trans, exp_res) }) test_that('printing', { rec <- recipe(~., data = ex_dat) %>% step_invlogit(x1) expect_output(print(rec)) expect_output(prep(rec, training = ex_dat)) }) recipes/tests/testthat/test_regex.R 0000644 0001777 0001777 00000002603 13135741217 020541 0 ustar herbrandt herbrandt library(testthat) library(recipes) data(covers) covers$rows <- 1:nrow(covers) covers$ch_rows <- paste(1:nrow(covers)) rec <- recipe(~ description + rows + ch_rows, covers) test_that('default options', { rec1 <- rec %>% step_regex(description, pattern = "(rock|stony)") %>% step_regex(description, result = "all ones") rec1 <- prep(rec1, training = covers) res1 <- bake(rec1, newdata = covers) expect_equal(res1$X.rock.stony., as.numeric(grepl("(rock|stony)", covers$description))) expect_equal(res1$`all ones`, rep(1, nrow(covers))) }) test_that('nondefault options', { rec2 <- rec %>% step_regex(description, pattern = "(rock|stony)", result = "rocks", options = list(fixed = TRUE)) rec2 <- prep(rec2, training = covers) res2 <- bake(rec2, newdata = covers) expect_equal(res2$rocks, rep(0, nrow(covers))) }) test_that('bad selector(s)', { expect_error(rec %>% step_regex(description, rows, pattern = "(rock|stony)")) rec3 <- rec %>% step_regex(starts_with("b"), pattern = "(rock|stony)") expect_error(prep(rec3, training = covers)) rec4 <- rec %>% step_regex(rows, pattern = "(rock|stony)") expect_error(prep(rec4, training = covers)) }) test_that('printing', { rec1 <- rec %>% step_regex(description, pattern = "(rock|stony)") expect_output(print(rec1)) expect_output(prep(rec1, training = covers)) }) recipes/tests/testthat/test_hyperbolic.R 0000644 0001777 0001777 00000001730 13135741217 021567 0 ustar herbrandt herbrandt library(testthat) library(recipes) library(tibble) n <- 20 ex_dat <- data.frame(x1 = seq(0, 1, length = n), x2 = seq(1, 0, length = n)) get_exp <- function(x, f) as_tibble(lapply(x, f)) test_that('simple hyperbolic trans', { for(func in c("sin", "cos", "tan")) { for(invf in c(TRUE, FALSE)) { rec <- recipe(~., data = ex_dat) %>% step_hyperbolic(x1, x2, func = func, inverse = invf) rec_trained <- prep(rec, training = ex_dat, verbose = FALSE) rec_trans <- bake(rec_trained, newdata = ex_dat) if(invf) { foo <- get(paste0("a", func)) } else { foo <- get(func) } exp_res <- get_exp(ex_dat, foo) expect_equal(rec_trans, exp_res) } } }) test_that('printing', { rec <- recipe(~., data = ex_dat) %>% step_hyperbolic(x1, x2, func = "sin", inverse = TRUE) expect_output(print(rec)) expect_output(prep(rec, training = ex_dat)) }) recipes/tests/testthat/test_modeimpute.R 0000644 0001777 0001777 00000002674 13135741217 021607 0 ustar herbrandt herbrandt library(testthat) library(recipes) data("credit_data") set.seed(342) in_training <- sample(1:nrow(credit_data), 2000) credit_tr <- credit_data[ in_training, ] credit_te <- credit_data[-in_training, ] test_that('simple modes', { rec <- recipe(Price ~ ., data = credit_tr) impute_rec <- rec %>% step_modeimpute(Status, Home, Marital) imputed <- prep(impute_rec, training = credit_tr, verbose = FALSE) te_imputed <- bake(imputed, newdata = credit_te) expect_equal(te_imputed$Status, credit_te$Status) home_exp <- rep(recipes:::mode_est(credit_tr$Home), sum(is.na(credit_te$Home))) home_exp <- factor(home_exp, levels = levels(credit_te$Home)) expect_equal(te_imputed$Home[is.na(credit_te$Home)], home_exp) marital_exp <- rep(recipes:::mode_est(credit_tr$Marital), sum(is.na(credit_te$Marital))) marital_exp <- factor(marital_exp, levels = levels(credit_te$Marital)) expect_equal(te_imputed$Marital[is.na(credit_te$Marital)], marital_exp) }) test_that('non-nominal', { rec <- recipe(Price ~ ., data = credit_tr) impute_rec <- rec %>% step_modeimpute(Assets, Job) expect_error(prep(impute_rec, training = credit_tr, verbose = FALSE)) }) test_that('printing', { impute_rec <- recipe(Price ~ ., data = credit_tr) %>% step_modeimpute(Status, Home, Marital) expect_output(print(impute_rec)) expect_output(prep(impute_rec, training = credit_tr)) }) recipes/NAMESPACE 0000644 0001777 0001777 00000013470 13135756764 014503 0 ustar herbrandt herbrandt # Generated by roxygen2: do not edit by hand S3method(bake,recipe) S3method(bake,step_BoxCox) S3method(bake,step_YeoJohnson) S3method(bake,step_bagimpute) S3method(bake,step_classdist) S3method(bake,step_corr) S3method(bake,step_date) S3method(bake,step_depth) S3method(bake,step_discretize) S3method(bake,step_dummy) S3method(bake,step_holiday) S3method(bake,step_hyperbolic) S3method(bake,step_ica) S3method(bake,step_interact) S3method(bake,step_invlogit) S3method(bake,step_isomap) S3method(bake,step_knnimpute) S3method(bake,step_kpca) S3method(bake,step_lincomb) S3method(bake,step_log) S3method(bake,step_logit) S3method(bake,step_meanimpute) S3method(bake,step_modeimpute) S3method(bake,step_ns) S3method(bake,step_nzv) S3method(bake,step_ordinalscore) S3method(bake,step_other) S3method(bake,step_pca) S3method(bake,step_poly) S3method(bake,step_range) S3method(bake,step_ratio) S3method(bake,step_rm) S3method(bake,step_scale) S3method(bake,step_shuffle) S3method(bake,step_spatialsign) S3method(bake,step_sqrt) S3method(bake,step_window) S3method(discretize,numeric) S3method(predict,discretize) S3method(prep,recipe) S3method(prep,step_BoxCox) S3method(prep,step_YeoJohnson) S3method(prep,step_bagimpute) S3method(prep,step_bin2factor) S3method(prep,step_classdist) S3method(prep,step_corr) S3method(prep,step_date) S3method(prep,step_depth) S3method(prep,step_discretize) S3method(prep,step_dummy) S3method(prep,step_holiday) S3method(prep,step_hyperbolic) S3method(prep,step_ica) S3method(prep,step_interact) S3method(prep,step_invlogit) S3method(prep,step_isomap) S3method(prep,step_knnimpute) S3method(prep,step_kpca) S3method(prep,step_lincomb) S3method(prep,step_log) S3method(prep,step_logit) S3method(prep,step_meanimpute) S3method(prep,step_modeimpute) S3method(prep,step_ns) S3method(prep,step_nzv) S3method(prep,step_ordinalscore) S3method(prep,step_other) S3method(prep,step_pca) S3method(prep,step_poly) S3method(prep,step_range) S3method(prep,step_ratio) S3method(prep,step_regex) S3method(prep,step_rm) S3method(prep,step_scale) S3method(prep,step_shuffle) S3method(prep,step_spatialsign) S3method(prep,step_sqrt) S3method(prep,step_window) S3method(print,discretize) S3method(print,recipe) S3method(recipe,data.frame) S3method(recipe,default) S3method(recipe,formula) S3method(recipe,matrix) S3method(summary,recipe) export("%>%") export(add_role) export(add_step) export(all_nominal) export(all_numeric) export(all_outcomes) export(all_predictors) export(bake) export(current_info) export(denom_vars) export(discretize) export(estimate_yj) export(has_role) export(has_type) export(imp_vars) export(juice) export(names0) export(prep) export(prepare) export(recipe) export(step) export(step_BoxCox) export(step_YeoJohnson) export(step_bagimpute) export(step_bin2factor) export(step_center) export(step_classdist) export(step_corr) export(step_date) export(step_depth) export(step_discretize) export(step_dummy) export(step_holiday) export(step_hyperbolic) export(step_ica) export(step_interact) export(step_intercept) export(step_invlogit) export(step_isomap) export(step_knnimpute) export(step_kpca) export(step_lincomb) export(step_log) export(step_logit) export(step_meanimpute) export(step_modeimpute) export(step_ns) export(step_nzv) export(step_ordinalscore) export(step_other) export(step_pca) export(step_poly) export(step_range) export(step_ratio) export(step_regex) export(step_rm) export(step_scale) export(step_shuffle) export(step_spatialsign) export(step_sqrt) export(step_window) export(terms_select) export(yj_trans) import(rlang) import(timeDate) importFrom(RcppRoll,roll_max) importFrom(RcppRoll,roll_maxl) importFrom(RcppRoll,roll_maxr) importFrom(RcppRoll,roll_mean) importFrom(RcppRoll,roll_meanl) importFrom(RcppRoll,roll_meanr) importFrom(RcppRoll,roll_median) importFrom(RcppRoll,roll_medianl) importFrom(RcppRoll,roll_medianr) importFrom(RcppRoll,roll_min) importFrom(RcppRoll,roll_minl) importFrom(RcppRoll,roll_minr) importFrom(RcppRoll,roll_prod) importFrom(RcppRoll,roll_prodl) importFrom(RcppRoll,roll_prodr) importFrom(RcppRoll,roll_sd) importFrom(RcppRoll,roll_sdl) importFrom(RcppRoll,roll_sdr) importFrom(RcppRoll,roll_sum) importFrom(RcppRoll,roll_suml) importFrom(RcppRoll,roll_sumr) importFrom(RcppRoll,roll_var) importFrom(RcppRoll,roll_varl) importFrom(RcppRoll,roll_varr) importFrom(ddalpha,depth.Mahalanobis) importFrom(ddalpha,depth.halfspace) importFrom(ddalpha,depth.potential) importFrom(ddalpha,depth.projection) importFrom(ddalpha,depth.simplicial) importFrom(ddalpha,depth.simplicialVolume) importFrom(ddalpha,depth.spatial) importFrom(ddalpha,depth.zonoid) importFrom(dimRed,FastICA) importFrom(dimRed,dimRedData) importFrom(dimRed,embed) importFrom(dimRed,kPCA) importFrom(dplyr,filter) importFrom(dplyr,full_join) importFrom(dplyr,left_join) importFrom(gower,gower_topn) importFrom(ipred,ipredbagg) importFrom(lubridate,decimal_date) importFrom(lubridate,is.Date) importFrom(lubridate,month) importFrom(lubridate,quarter) importFrom(lubridate,semester) importFrom(lubridate,wday) importFrom(lubridate,week) importFrom(lubridate,yday) importFrom(lubridate,year) importFrom(magrittr,"%>%") importFrom(purrr,map) importFrom(purrr,map_chr) importFrom(purrr,map_if) importFrom(purrr,map_lgl) importFrom(rlang,expr) importFrom(rlang,f_lhs) importFrom(rlang,is_empty) importFrom(rlang,names2) importFrom(rlang,quos) importFrom(splines,ns) importFrom(stats,as.formula) importFrom(stats,binomial) importFrom(stats,complete.cases) importFrom(stats,cor) importFrom(stats,cov) importFrom(stats,mahalanobis) importFrom(stats,model.frame) importFrom(stats,model.matrix) importFrom(stats,optimize) importFrom(stats,poly) importFrom(stats,prcomp) importFrom(stats,predict) importFrom(stats,quantile) importFrom(stats,sd) importFrom(stats,terms) importFrom(stats,var) importFrom(tibble,add_column) importFrom(tibble,as_tibble) importFrom(tibble,is_tibble) importFrom(tibble,tibble) recipes/NEWS.md 0000644 0001777 0001777 00000002252 13136242173 014337 0 ustar herbrandt herbrandt # recipes 0.1.0 First CRAN release. * Changed `prepare` to `prep` per [issue #59](https://github.com/topepo/recipes/issues/59) # recipes 0.0.1.9003 * Two of the main functions [changed names](https://github.com/topepo/recipes/issues/57). `learn` has become `prepare` and `process` has become `bake` # recipes 0.0.1.9002 New steps: * `step_lincomb` removes variables involved in linear combinations to resolve them. * A step for converting binary variables to factors (`step_bin2factor`) * `step_regex` applies a regular expression to a character or factor vector to create dummy variables. Other changes: * `step_dummy` and `step_interact` do a better job of respecting missing values in the data set. # recipes 0.0.1.9001 * The class system for `recipe` objects was changed so that [pipes can be used to create the recipe with a formula](https://github.com/topepo/recipes/issues/46). * `process.recipe` lost the `role` argument in factor of a general set of [selectors](https://topepo.github.io/recipes/articles/Selecting_Variables.html). If no selector is used, all the predictors are returned. * Two steps for simple imputation using the mean or mode were added. recipes/data/ 0000755 0001777 0001777 00000000000 13136242227 014151 5 ustar herbrandt herbrandt recipes/data/covers.RData 0000644 0001777 0001777 00000001320 13104230631 016352 0 ustar herbrandt herbrandt Vn@5MAj%2