AnnotationFilter/DESCRIPTION0000644000175400017540000000305113175747732016605 0ustar00biocbuildbiocbuildPackage: AnnotationFilter Title: Facilities for Filtering Bioconductor Annotation Resources Version: 1.2.0 Authors@R: c( person("Martin", "Morgan", email = "martin.morgan@roswellpark.org", role = "aut"), person("Johannes", "Rainer", email = "johannes.rainer@eurac.edu", role = "aut"), person("Joachim", "Bargsten", email = "jw@bargsten.org", role = "ctb"), person("Daniel", "Van Twisk", email = "daniel.vantwisk@roswellpark.org", role = "ctb"), person("Bioconductor", "Maintainer", email="maintainer@bioconductor.org", role = "cre")) URL: https://github.com/Bioconductor/AnnotationFilter BugReports: https://github.com/Bioconductor/AnnotationFilter/issues Description: This package provides class and other infrastructure to implement filters for manipulating Bioconductor annotation resources. The filters will be used by ensembldb, Organism.dplyr, and other packages. Depends: R (>= 3.4.0) Imports: utils, methods, GenomicRanges, lazyeval Suggests: BiocStyle, knitr, testthat, RSQLite, org.Hs.eg.db VignetteBuilder: knitr License: Artistic-2.0 biocViews: Annotation, Infrastructure, Software Encoding: UTF-8 LazyData: true RoxygenNote: 6.0.1 Collate: 'AllGenerics.R' 'AnnotationFilter.R' 'AnnotationFilterList.R' 'translate-utils.R' NeedsCompilation: no Packaged: 2017-10-31 01:20:26 UTC; biocbuild Author: Martin Morgan [aut], Johannes Rainer [aut], Joachim Bargsten [ctb], Daniel Van Twisk [ctb], Bioconductor Maintainer [cre] Maintainer: Bioconductor Maintainer AnnotationFilter/NAMESPACE0000644000175400017540000000376613175715601016321 0ustar00biocbuildbiocbuild# Generated by roxygen2: do not edit by hand export(AnnotationFilter) export(AnnotationFilterList) export(CdsEndFilter) export(CdsStartFilter) export(EntrezFilter) export(ExonEndFilter) export(ExonIdFilter) export(ExonNameFilter) export(ExonRankFilter) export(ExonStartFilter) export(GRangesFilter) export(GeneBiotypeFilter) export(GeneEndFilter) export(GeneIdFilter) export(GeneStartFilter) export(GenenameFilter) export(ProteinIdFilter) export(SeqNameFilter) export(SeqStrandFilter) export(SymbolFilter) export(TxBiotypeFilter) export(TxEndFilter) export(TxIdFilter) export(TxNameFilter) export(TxStartFilter) export(UniprotFilter) export(feature) export(logicOp) export(not) exportClasses(AnnotationFilter) exportClasses(AnnotationFilterList) exportClasses(CdsEndFilter) exportClasses(CdsStartFilter) exportClasses(CharacterFilter) exportClasses(EntrezFilter) exportClasses(ExonEndFilter) exportClasses(ExonIdFilter) exportClasses(ExonNameFilter) exportClasses(ExonRankFilter) exportClasses(ExonStartFilter) exportClasses(GRangesFilter) exportClasses(GeneBiotypeFilter) exportClasses(GeneEndFilter) exportClasses(GeneIdFilter) exportClasses(GeneStartFilter) exportClasses(GenenameFilter) exportClasses(IntegerFilter) exportClasses(ProteinIdFilter) exportClasses(SeqNameFilter) exportClasses(SeqStrandFilter) exportClasses(SymbolFilter) exportClasses(TxBiotypeFilter) exportClasses(TxEndFilter) exportClasses(TxIdFilter) exportClasses(TxNameFilter) exportClasses(TxStartFilter) exportClasses(UniprotFilter) exportMethods(condition) exportMethods(convertFilter) exportMethods(distributeNegation) exportMethods(field) exportMethods(not) exportMethods(show) exportMethods(supportedFilters) exportMethods(value) importClassesFrom(GenomicRanges,GRanges) importFrom(GenomicRanges,GRanges) importFrom(GenomicRanges,show) importFrom(lazyeval,f_eval) importFrom(methods,callNextMethod) importFrom(methods,initialize) importFrom(methods,is) importFrom(methods,new) importFrom(methods,show) importFrom(methods,validObject) importFrom(utils,tail) AnnotationFilter/NEWS0000644000175400017540000000043513175715601015567 0ustar00biocbuildbiocbuildCHANGES IN VERSION 1.1.2 ------------------------ NEW FEATURES o supportFilters returns a data.frame with filter class name and field. CHANGES IN VERSION 0.99.5 -------------------------- NEW FEATURES o Add convertFilterExpressionQuoted function. o Add field method. AnnotationFilter/NOTES.md0000644000175400017540000000042513175715601016301 0ustar00biocbuildbiocbuild# Development guidelines - roxygen2 documentation - testthat unit tests - file name correspondence between code `R/foo.R`, tests `tests/testthat/test_foo.R`, and documentation `man/foo.Rd`. - version bump on master commit - commits to master pass R CMD build && R CMD check AnnotationFilter/R/0000755000175400017540000000000013175715601015267 5ustar00biocbuildbiocbuildAnnotationFilter/R/AllGenerics.R0000644000175400017540000000133613175715601017605 0ustar00biocbuildbiocbuild## Generic methods. setGeneric("condition", function(object, ...) standardGeneric("condition")) setGeneric("field", function(object, ...) standardGeneric("field")) setGeneric("value", function(object, ...) standardGeneric("value")) setGeneric("logicOp", function(object, ...) standardGeneric("logicOp")) setGeneric("not", function(object, ...) standardGeneric("not")) setGeneric("simplify", function(object, ...) standardGeneric("simplify")) setGeneric("convertFilter", function(object, db, ...) standardGeneric("convertFilter")) setGeneric("distributeNegation", function(object, ...) standardGeneric("distributeNegation")) setGeneric("supportedFilters", function(object, ...) standardGeneric("supportedFilters")) AnnotationFilter/R/AnnotationFilter.R0000644000175400017540000004012513175715601020674 0ustar00biocbuildbiocbuild#' @name AnnotationFilter #' #' @title Filters for annotation objects #' #' @aliases CdsStartFilter CdsEndFilter ExonIdFilter ExonNameFilter #' ExonStartFilter ExonEndFilter ExonRankFilter GeneIdFilter #' GenenameFilter GeneBiotypeFilter GeneStartFilter GeneEndFilter #' EntrezFilter SymbolFilter TxIdFilter TxNameFilter #' TxBiotypeFilter TxStartFilter TxEndFilter ProteinIdFilter #' UniprotFilter SeqNameFilter SeqStrandFilter #' AnnotationFilter-class CharacterFilter-class #' IntegerFilter-class CdsStartFilter-class CdsEndFilter-class #' ExonIdFilter-class ExonNameFilter-class ExonStartFilter-class #' ExonEndFilter-class ExonRankFilter-class GeneIdFilter-class #' GenenameFilter-class GeneBiotypeFilter-class #' GeneStartFilter-class GeneEndFilter-class EntrezFilter-class #' SymbolFilter-class TxIdFilter-class TxNameFilter-class #' TxBiotypeFilter-class TxStartFilter-class TxEndFilter-class #' ProteinIdFilter-class UniprotFilter-class SeqNameFilter-class #' SeqStrandFilter-class supportedFilters #' show,AnnotationFilter-method show,CharacterFilter-method #' show,IntegerFilter-method show,GRangesFilter-method #' #' @description #' #' The filters extending the base \code{AnnotationFilter} class #' represent a simple filtering concept for annotation resources. #' Each filter object is thought to filter on a single (database) #' table column using the provided values and the defined condition. #' #' Filter instances created using the constructor functions (e.g. #' \code{GeneIdFilter}). #' #' \code{supportedFilters()} lists all defined filters. It returns a two column #' \code{data.frame} with the filter class name and its default field. #' Packages using \code{AnnotationFilter} should implement the #' \code{supportedFilters} for their annotation resource object (e.g. for #' \code{object = "EnsDb"} in the \code{ensembldb} package) to list all #' supported filters for the specific resource. #' #' @details #' #' By default filters are only available for tables containing the #' field on which the filter acts (i.e. that contain a column with the #' name matching the value of the \code{field} slot of the #' object). See the vignette for a description to use filters for #' databases in which the database table column name differs from the #' default \code{field} of the filter. #' #' @usage #' #' CdsStartFilter(value, condition = "==", not = FALSE) #' CdsEndFilter(value, condition = "==", not = FALSE) #' ExonIdFilter(value, condition = "==", not = FALSE) #' ExonNameFilter(value, condition = "==", not = FALSE) #' ExonRankFilter(value, condition = "==", not = FALSE) #' ExonStartFilter(value, condition = "==", not = FALSE) #' ExonEndFilter(value, condition = "==", not = FALSE) #' GeneIdFilter(value, condition = "==", not = FALSE) #' GenenameFilter(value, condition = "==", not = FALSE) #' GeneBiotypeFilter(value, condition = "==", not = FALSE) #' GeneStartFilter(value, condition = "==", not = FALSE) #' GeneEndFilter(value, condition = "==", not = FALSE) #' EntrezFilter(value, condition = "==", not = FALSE) #' SymbolFilter(value, condition = "==", not = FALSE) #' TxIdFilter(value, condition = "==", not = FALSE) #' TxNameFilter(value, condition = "==", not = FALSE) #' TxBiotypeFilter(value, condition = "==", not = FALSE) #' TxStartFilter(value, condition = "==", not = FALSE) #' TxEndFilter(value, condition = "==", not = FALSE) #' ProteinIdFilter(value, condition = "==", not = FALSE) #' UniprotFilter(value, condition = "==", not = FALSE) #' SeqNameFilter(value, condition = "==", not = FALSE) #' SeqStrandFilter(value, condition = "==", not = FALSE) #' #' @param value \code{character()}, \code{integer()}, or #' \code{GRanges()} value for the filter #' #' @param condition \code{character(1)} defining the condition to be #' used in the filter. For \code{IntegerFilter}, one of #' \code{"=="}, \code{"!="}, \code{">"}, \code{"<"}, \code{">="} #' or \code{"<="}. For \code{CharacterFilter}, one of \code{"=="}, #' \code{"!="}, \code{"startsWith"}, \code{"endsWith"} or \code{"contains"}. #' Default condition is \code{"=="}. #' #' @param not \code{logical(1)} whether the \code{AnnotationFilter} is negated. #' \code{TRUE} indicates is negated (!). \code{FALSE} indicates not #' negated. Default not is \code{FALSE}. #' #' @return The constructor function return an object extending #' \code{AnnotationFilter}. For the return value of the other methods see #' the methods' descriptions. #' #' @seealso \code{\link{AnnotationFilterList}} for combining #' \code{AnnotationFilter} objects. NULL .CONDITION <- list( IntegerFilter = c("==", "!=", ">", "<", ">=", "<="), CharacterFilter = c("==", "!=", "startsWith", "endsWith", "contains"), GRangesFilter = c("any", "start", "end", "within", "equal") ) .FIELD <- list( CharacterFilter = c( "exon_id", "exon_name", "gene_id", "genename", "gene_biotype", "entrez", "symbol", "tx_id", "tx_name", "tx_biotype", "protein_id", "uniprot", "seq_name", "seq_strand"), IntegerFilter = c( "cds_start", "cds_end", "exon_start", "exon_rank", "exon_end", "gene_start", "gene_end", "tx_start", "tx_end") ) .valid_condition <- function(condition, class) { txt <- character() test0 <- length(condition) == 1L if (!test0) txt <- c(txt, "'condition' must be length 1") test1 <- test0 && (condition %in% .CONDITION[[class]]) if (!test1) { value <- paste(sQuote(.CONDITION[[class]]), collapse=" ") txt <- c(txt, paste0("'", condition, "' must be in ", value)) } if (length(txt)) txt else TRUE } ############################################################ ## AnnotationFilter ## #' @exportClass AnnotationFilter .AnnotationFilter <- setClass( "AnnotationFilter", contains = "VIRTUAL", slots = c( field="character", condition="character", value="ANY", not="logical" ), prototype=list( condition= "==", not= FALSE ) ) setValidity("AnnotationFilter", function(object) { txt <- character() value <- .value(object) condition <- .condition(object) not <- .not(object) test_len <- length(condition) == 1L test_NA <- !any(is.na(condition)) if (test_len && !test_NA) txt <- c(txt, "'condition' can not be NA") test0 <- test_len && test_NA test1 <- condition %in% c("startsWith", "endsWith", "contains", ">", "<", ">=", "<=") if (test0 && test1 && length(value) > 1L) txt <- c(txt, paste0("'", condition, "' requires length 1 'value'")) if(length(not) != 1) txt <- c(txt, '"not" value must be of length 1.') if (any(is.na(value))) txt <- c(txt, "'value' can not be NA") if (length(txt)) txt else TRUE }) .field <- function(object) object@field .condition <- function(object) object@condition .value <- function(object) object@value .not <- function(object) object@not #' @rdname AnnotationFilter #' #' @aliases condition #' #' @description \code{condition()} get the \code{condition} value for #' the filter \code{object}. #' #' @param object An \code{AnnotationFilter} object. #' #' @export setMethod("condition", "AnnotationFilter", .condition) #' @rdname AnnotationFilter #' #' @aliases value #' #' @description \code{value()} get the \code{value} for the filter #' \code{object}. #' #' @export setMethod("value", "AnnotationFilter", .value) #' @rdname AnnotationFilter #' #' @aliases field #' #' @description \code{field()} get the \code{field} for the filter #' \code{object}. #' #' @export setMethod("field", "AnnotationFilter", .field) #' @rdname AnnotationFilter #' #' @description \code{not()} get the \code{not} for the filter \code{object}. #' #' @export setMethod("not", "AnnotationFilter", .not) #' @importFrom methods show #' #' @export setMethod("show", "AnnotationFilter", function(object){ if(.not(object)) cat("NOT\n") cat("class:", class(object), "\ncondition:", .condition(object), "\n") }) ############################################################ ## CharacterFilter, IntegerFilter ## #' @exportClass CharacterFilter .CharacterFilter <- setClass( "CharacterFilter", contains = c("VIRTUAL", "AnnotationFilter"), slots = c(value = "character"), prototype = list( value = character() ) ) setValidity("CharacterFilter", function(object) { .valid_condition(.condition(object), "CharacterFilter") }) #' @importFrom methods show callNextMethod #' #' @export setMethod("show", "CharacterFilter", function(object) { callNextMethod() cat("value:", .value(object), "\n") }) #' @exportClass IntegerFilter .IntegerFilter <- setClass( "IntegerFilter", contains = c("VIRTUAL", "AnnotationFilter"), slots = c(value = "integer"), prototype = list( value = integer() ) ) setValidity("IntegerFilter", function(object) { .valid_condition(.condition(object), "IntegerFilter") }) #' @export setMethod("show", "IntegerFilter", function(object) { callNextMethod() cat("value:", .value(object), "\n") }) #' @rdname AnnotationFilter #' #' @importFrom GenomicRanges GRanges #' #' @importClassesFrom GenomicRanges GRanges #' #' @exportClass GRangesFilter .GRangesFilter <- setClass( "GRangesFilter", contains = "AnnotationFilter", slots = c( value = "GRanges", feature = "character" ), prototype = list( value = GRanges(), condition = "any", field = "granges", feature = "gene" ) ) setValidity("GRangesFilter", function(object) { .valid_condition(.condition(object), "GRangesFilter") }) .feature <- function(object) object@feature #' @rdname AnnotationFilter #' #' @param type \code{character(1)} indicating how overlaps are to be #' filtered. See \code{findOverlaps} in the IRanges package for a #' description of this argument. #' #' @examples #' ## filter by GRanges #' GRangesFilter(GenomicRanges::GRanges("chr10:87869000-87876000")) #' @export GRangesFilter <- function(value, feature = "gene", type = c("any", "start", "end", "within", "equal")) { condition <- match.arg(type) .GRangesFilter( field = "granges", value = value, condition = condition, feature = feature) } .feature <- function(object) object@feature #' @aliases feature #' #' @description \code{feature()} get the \code{feature} for the #' \code{GRangesFilter} \code{object}. #' #' @rdname AnnotationFilter #' #' @export feature <- .feature #' @importFrom GenomicRanges show #' #' @export setMethod("show", "GRangesFilter", function(object) { callNextMethod() cat("feature:", .feature(object), "\nvalue:\n") show(value(object)) }) ############################################################ ## Create install-time classes ## #' @rdname AnnotationFilter #' #' @name AnnotationFilter #' #' @param feature \code{character(1)} defining on what feature the #' \code{GRangesFilter} should be applied. Choices could be #' \code{"gene"}, \code{"tx"} or \code{"exon"}. #' #' @examples #' ## Create a SymbolFilter to filter on a gene's symbol. #' sf <- SymbolFilter("BCL2") #' sf #' #' ## Create a GeneStartFilter to filter based on the genes' chromosomal start #' ## coordinates #' gsf <- GeneStartFilter(10000, condition = ">") #' gsf #' #' @export CdsStartFilter CdsEndFilter ExonIdFilter ExonNameFilter #' @export ExonStartFilter ExonEndFilter ExonRankFilter GeneIdFilter #' @export GenenameFilter GeneBiotypeFilter GeneStartFilter #' @export GeneEndFilter EntrezFilter SymbolFilter TxIdFilter #' @export TxNameFilter TxBiotypeFilter TxStartFilter TxEndFilter #' @export ProteinIdFilter UniprotFilter SeqNameFilter SeqStrandFilter #' #' @importFrom methods new #' #' @exportClass CdsStartFilter CdsEndFilter ExonIdFilter #' ExonNameFilter ExonStartFilter ExonEndFilter ExonRankFilter #' GeneIdFilter GenenameFilter GeneBiotypeFilter GeneStartFilter #' GeneEndFilter EntrezFilter SymbolFilter TxIdFilter TxNameFilter #' TxBiotypeFilter TxStartFilter TxEndFilter ProteinIdFilter #' UniprotFilter SeqNameFilter SeqStrandFilter NULL .fieldToClass <- function(field) { class <- gsub("_([[:alpha:]])", "\\U\\1", field, perl=TRUE) class <- sub("^([[:alpha:]])", "\\U\\1", class, perl=TRUE) paste0(class, if (length(class)) "Filter" else character(0)) } .filterFactory <- function(field, class) { force(field); force(class) # watch for lazy evaluation as.value <- if (field %in% .FIELD[["CharacterFilter"]]) { function(x) { # if(!is.character(x)) # stop("Input to a ", field, # "filter must be a character vector.") as.character(x) } } else { function(x) { if(!is.numeric(x)) stop("Input to a ", field, "filter must be a numeric vector.") as.integer(x) } } function(value, condition = "==", not = FALSE) { value <- as.value(value) condition <- as.character(condition) not <- as.logical(not) new(class, field=field, condition = condition, value=value, not=not) } } local({ makeClass <- function(contains) { fields <- .FIELD[[contains]] classes <- .fieldToClass(fields) for (i in seq_along(fields)) { setClass(classes[[i]], contains=contains, where=topenv()) assign( classes[[i]], .filterFactory(fields[[i]], classes[[i]]), envir=topenv() ) } } for (contains in names(.FIELD)) makeClass(contains) }) ############################################################ ## Utilities ## .convertFilter <- function(object) { field <- field(object) if (field == "granges") stop("GRangesFilter cannot be converted using convertFilter().") value <- value(object) condition <- condition(object) not <- not(object) op <- switch( condition, "==" = if (length(value) == 1) "==" else "%in%", "!=" = if (length(value) == 1) "!=" else "%in%", "startsWith" = "%like%", "endsWith" = "%like%", "contains" = "%like%" ) not_val <- ifelse(not, '!', '') if (condition %in% c("==", "!=")) value <- paste0("'", value, "'", collapse=", ") if (!is.null(op) && op %in% c("==", "!=")) sprintf("%s%s %s %s", not_val, field, op, value) else if ((condition == "==") && op == "%in%") sprintf("%s%s %s c(%s)", not_val, field, op, value) else if ((condition == "!=") && op == "%in%") if(not) sprintf("%s %s c(%s)", field, op, value) else sprintf("!%s%s %s c(%s)", not_val, field, op, value) else if (condition == "startsWith") sprintf("%s%s %s '%s%%'", not_val, field, op, value) else if (condition == "endsWith") sprintf("%s%s %s '%%%s'", not_val, field, op, value) else if (condition == "contains") sprintf("%s%s %s '%s'", not_val, field, op, value) else if (condition %in% c(">", "<", ">=", "<=")) { sprintf("%s%s %s %s", not_val, field, condition, as.integer(value)) } } #' @rdname AnnotationFilter #' #' @description Converts an \code{AnnotationFilter} object to a #' \code{character(1)} giving an equation that can be used as input to #' a \code{dplyr} filter. #' #' @return \code{character(1)} that can be used as input to a \code{dplyr} #' filter. #' #' @examples #' filter <- SymbolFilter("ADA", "==") #' result <- convertFilter(filter) #' result #' @export setMethod("convertFilter", signature(object = "AnnotationFilter", db = "missing"), .convertFilter) .FILTERS_WO_FIELD <- c("GRangesFilter") .supportedFilters <- function() { fields <- unlist(.FIELD, use.names=FALSE) filters <- .fieldToClass(fields) d <- data.frame( filter=c(filters, .FILTERS_WO_FIELD), field=c(fields, "granges") #rep(NA, length(.FILTERS_WO_FIELD))) ) d[order(d$filter),] } #' @rdname AnnotationFilter #' #' @examples #' supportedFilters() #' @export setMethod("supportedFilters", "missing", function(object) { .supportedFilters() }) AnnotationFilter/R/AnnotationFilterList.R0000644000175400017540000002454613175715601021541 0ustar00biocbuildbiocbuild#' @include AnnotationFilter.R #' @rdname AnnotationFilterList #' #' @name AnnotationFilterList #' #' @title Combining annotation filters #' #' @aliases AnnotationFilterList-class #' #' @description The \code{AnnotationFilterList} allows to combine #' filter objects extending the \code{\link{AnnotationFilter}} #' class to construct more complex queries. Consecutive filter #' objects in the \code{AnnotationFilterList} can be combined by a #' logical \emph{and} (\code{&}) or \emph{or} (\code{|}). The #' \code{AnnotationFilterList} extends \code{list}, individual #' elements can thus be accessed with \code{[[}. #' #' @note The \code{AnnotationFilterList} does not support containing empty #' elements, hence all elements of \code{length == 0} are removed in #' the constructor function. #' #' @exportClass AnnotationFilterList NULL .AnnotationFilterList <- setClass( "AnnotationFilterList", contains = "list", slots = c(logOp = "character", not = "logical", .groupingFlag = "logical") ) .LOG_OPS <- c("&", "|") setValidity("AnnotationFilterList", function(object) { txt <- character() filters <- .aflvalue(object) logOp <- .logOp(object) not <- .not(object) if (length(filters) == 0 && length(logOp)) { txt <- c( txt, "'logicOp' can not have length > 0 if the object is empty" ) } else if (length(filters) != 0) { ## Note: we allow length of filters being 1, but then logOp has ## to be empty. Check content: fun <- function(z) is(z, "AnnotationFilter") || is(z, "AnnotationFilterList") test <- vapply(filters, fun, logical(1)) if (!all(test)){ txt <- c( txt, "only 'AnnotationFilter' or 'AnnotationFilterList' allowed" ) } # Check that all elements are non-empty (issue #17). Doing this ## separately from the check above to ensure we get a different error ## message. if (!all(lengths(filters) > 0)) txt <- c(txt, "Lengths of all elements have to be > 0") ## Check that logOp has length object -1 if (length(logOp) != length(filters) - 1) txt <- c(txt, "length of 'logicOp' has to be length of the object -1") ## Check content of logOp. if (!all(logOp %in% .LOG_OPS)) txt <- c(txt, "'logicOp' can only contain '&' and '|'") } if (length(txt)) txt else TRUE }) ## AnnotationFilterList constructor function. #' @rdname AnnotationFilterList #' #' @name AnnotationFilterList #' #' @param ... individual \code{\link{AnnotationFilter}} objects or a #' mixture of \code{AnnotationFilter} and #' \code{AnnotationFilterList} objects. #' #' @param logicOp \code{character} of length equal to the number #' of submitted \code{AnnotationFilter} objects - 1. Each value #' representing the logical operation to combine consecutive #' filters, i.e. the first element being the logical operation to #' combine the first and second \code{AnnotationFilter}, the #' second element being the logical operation to combine the #' second and third \code{AnnotationFilter} and so on. Allowed #' values are \code{"&"} and \code{"|"}. The function assumes a #' logical \emph{and} between all elements by default. #' #' @param logOp Deprecated; use \code{logicOp=}. #' #' @param .groupingFlag Flag desginated for internal use only. #' #' @param not \code{logical} of length one. Indicates whether the grouping #' of \code{AnnotationFilters} are to be negated. #' #' @seealso \code{\link{supportedFilters}} for available #' \code{\link{AnnotationFilter}} objects #' #' @return \code{AnnotationFilterList} returns an \code{AnnotationFilterList}. #' #' @examples #' ## Create some AnnotationFilters #' gf <- GenenameFilter(c("BCL2", "BCL2L11")) #' tbtf <- TxBiotypeFilter("protein_coding", condition = "!=") #' #' ## Combine both to an AnnotationFilterList. By default elements are combined #' ## using a logical "and" operator. The filter list represents thus a query #' ## like: get all features where the gene name is either ("BCL2" or "BCL2L11") #' ## and the transcript biotype is not "protein_coding". #' afl <- AnnotationFilterList(gf, tbtf) #' afl #' #' ## Access individual filters. #' afl[[1]] #' #' ## Create a filter in the form of: get all features where the gene name is #' ## either ("BCL2" or "BCL2L11") and the transcript biotype is not #' ## "protein_coding" or the seq_name is "Y". Hence, this will get all feature #' ## also found by the previous AnnotationFilterList and returns also all #' ## features on chromosome Y. #' afl <- AnnotationFilterList(gf, tbtf, SeqNameFilter("Y"), #' logicOp = c("&", "|")) #' afl #' #' @export AnnotationFilterList <- function(..., logicOp = character(), logOp = character(), not = FALSE, .groupingFlag=FALSE) { if (!missing(logOp) && missing(logicOp)) { logicOp <- logOp .Deprecated(msg = "'logOp' deprecated, use 'logicOp'") } filters <- list(...) ## Remove empty nested lists and AnnotationFilterLists removal <- lengths(filters) != 0 filters <- filters[removal] if (length(filters) > 1 & length(logicOp) == 0) ## By default we're assuming & between elements. logicOp <- rep("&", (length(filters) - 1)) .AnnotationFilterList(filters, logOp = logicOp, not = not, .groupingFlag=.groupingFlag) } .logOp <- function(object) object@logOp .aflvalue <- function(object) object@.Data .not <- function(object) object@not #' @rdname AnnotationFilterList #' #' @description \code{value()} get a \code{list} with the #' \code{AnnotationFilter} objects. Use \code{[[} to access #' individual filters. #' #' @return \code{value()} returns a \code{list} with \code{AnnotationFilter} #' objects. #' #' @export setMethod("value", "AnnotationFilterList", .aflvalue) #' @rdname AnnotationFilterList #' #' @aliases logicOp #' #' @description \code{logicOp()} gets the logical operators separating #' successive \code{AnnotationFilter}. #' #' @return \code{logicOp()} returns a \code{character()} vector of #' \dQuote{&} or \dQuote{|} symbols. #' #' @export logicOp setMethod("logicOp", "AnnotationFilterList", .logOp) #' @rdname AnnotationFilterList #' #' @aliases not #' #' @description \code{not()} gets the logical operators separating #' successive \code{AnnotationFilter}. #' #' @return \code{not()} returns a \code{character()} vector of #' \dQuote{&} or \dQuote{|} symbols. #' #' @export not setMethod("not", "AnnotationFilterList", .not) .distributeNegation <- function(object, .prior_negation=FALSE) { if(.not(object)) .prior_negation <- ifelse(.prior_negation, FALSE, TRUE) filters <- lapply(object, function(x){ if(is(x, "AnnotationFilterList")) distributeNegation(x, .prior_negation) else{ if(.prior_negation) x@not <- ifelse(x@not, FALSE, TRUE) x } }) ops <- vapply(logicOp(object), function(x) { if(.prior_negation){ if(x == '&') '|' else '&' } else x } ,character(1) ) ops <- unname(ops) filters[['logicOp']] <- ops do.call("AnnotationFilterList", filters) } #' @rdname AnnotationFilterList #' #' @aliases distributeNegation #' #' @description #' #' @param .prior_negation \code{logical(1)} unused argument. #' #' @return \code{AnnotationFilterList} object with DeMorgan's law applied to #' it such that it is equal to the original \code{AnnotationFilterList} #' object but all \code{!}'s are distributed out of the #' \code{AnnotationFilterList} object and to the nested #' \code{AnnotationFilter} objects. #' #' @examples #' afl <- AnnotationFilter(~!(symbol == 'ADA' | symbol %startsWith% 'SNORD')) #' afl <- distributeNegation(afl) #' afl #' @export setMethod("distributeNegation", "AnnotationFilterList", .distributeNegation) .convertFilterList <- function(object, show, granges=list(), nested=FALSE) { filters <- value(object) result <- character(length(filters)) for (i in seq_len(length(filters))) { if (is(filters[[i]], "AnnotationFilterList")) { res <- .convertFilterList(filters[[i]], show=show, granges=granges, nested=TRUE) granges <- c(granges, res[[2]]) result[i] <- res[[1]] } else if (field(filters[[i]]) == "granges") { if(!show) result[i] <- .convertFilter(filters[[i]]) else { nam <- paste0("GRangesFilter_", length(granges) + 1) granges <- c(granges, list(filters[[i]])) result[i] <- nam } } else result[i] <- .convertFilter(filters[[i]]) } result_last <- tail(result, 1) result <- head(result, -1) result <- c(rbind(result, logicOp(object))) result <- c(result, result_last) result <- paste(result, collapse=" ") if(nested || object@not) result <- paste0("(", result, ")") if(object@not) result <- paste0("!", result) list(result, granges) } #' @rdname AnnotationFilterList #' #' @aliases convertFilter #' #' @description Converts an \code{AnnotationFilterList} object to a #' \code{character(1)} giving an equation that can be used as input to #' a \code{dplyr} filter. #' #' @return \code{character(1)} that can be used as input to a \code{dplyr} #' filter. #' #' @examples #' afl <- AnnotationFilter(~symbol=="ADA" & tx_start > "400000") #' result <- convertFilter(afl) #' result #' @export setMethod("convertFilter", signature(object = "AnnotationFilterList", db = "missing") , function(object) { result <- .convertFilterList(object, show=FALSE) result[[1]] }) #' @rdname AnnotationFilterList #' #' @param object An object of class \code{AnnotationFilterList}. #' #' @importFrom utils tail #' @export setMethod("show", "AnnotationFilterList", function(object) { result <- .convertFilterList(object, show=TRUE) granges <- result[[2]] result <- result[[1]] cat("AnnotationFilterList of length", length(object), "\n") cat(result) cat("\n") for(i in seq_len(length(granges))) { cat("\n") cat("Symbol: GRangesFilter_", i, "\n", sep="") show(granges[[1]]) cat("\n") } }) AnnotationFilter/R/translate-utils.R0000644000175400017540000001330613175715601020550 0ustar00biocbuildbiocbuild#' @include AnnotationFilter.R ## Functionality to translate a query condition to an AnnotationFilter. #' Adapted from GenomicDataCommons. #' #' @importFrom methods is validObject initialize #' #' @noRd .binary_op <- function(sep) { force(sep) function(e1, e2) { ## First create the class. Throws an error if not possible i.e. no ## class for the field available. field <- as.character(substitute(e1)) class <- .fieldToClass(field) filter <- tryCatch({ new(class, condition = sep, field = field) }, error = function(e) { stop("No AnnotationFilter class '", class, "' for field '", field, "' defined") }) ## Fill with values. force(e2) if (is(filter, "CharacterFilter")) { e2 <- as.character(e2) } else if (is(filter, "IntegerFilter")) { e2 <- as.integer(e2) } initialize(filter, value = e2) } } #' Functionality to translate a unary operation into an AnnotationFilter. #' #' @noRd .not_op <- function(sep) { force(sep) function(x) { if(is(x, "AnnotationFilterList") || is(x, "AnnotationFilter")) { if(x@not) x@not <- FALSE else x@not <- TRUE if(is(x, "AnnotationFilterList")) x@.groupingFlag <- FALSE return(x) } # else if (is(x, "AnnotationFilter")) # AnnotationFilterList(x, logicOp=character(), not=TRUE) else stop('Arguments to "!" must be an AnnotationFilter or AnnotationFilerList.') } } .parenthesis_op <- function(sep) { force(sep) function(x) { if (is(x, "AnnotationFilterList")) { x@.groupingFlag <- FALSE x } else AnnotationFilterList(x, .groupingFlag=FALSE) } } #' Combine filters into a AnnotationFilterList combbined with \code{sep} #' #' @noRd .combine_op <- function(sep) { force(sep) function(e1, e2) { op1 <- character() op2 <- character() if (is(e1, "AnnotationFilterList") && e1@.groupingFlag) { op1 <- logicOp(e1) e1 <- .aflvalue(e1) } else { e1 <- list(e1) } if (is(e2, "AnnotationFilterList") && e2@.groupingFlag) { op2 <- logicOp(e2) e2 <- .aflvalue(e2) } else { e2 <- list(e2) } input <- c(e1, e2) input[['logicOp']] <- c(op1, sep, op2) input[['.groupingFlag']] <- TRUE do.call("AnnotationFilterList", input) } } #' The \code{.LOG_OP_REG} is a \code{list} providing functions for #' common logical operations to translate expressions into AnnotationFilter #' objects. #' #' @noRd .LOG_OP_REG <- list() ## Assign conditions. .LOG_OP_REG$`==` <- .binary_op("==") .LOG_OP_REG$`%in%` <- .binary_op("==") .LOG_OP_REG$`!=` <- .binary_op("!=") .LOG_OP_REG$`>` <- .binary_op(">") .LOG_OP_REG$`<` <- .binary_op("<") .LOG_OP_REG$`>=` <- .binary_op(">=") .LOG_OP_REG$`<=` <- .binary_op("<=") ## Custom binary operators .LOG_OP_REG$`%startsWith%` <- .binary_op("startsWith") .LOG_OP_REG$`%endsWith%` <- .binary_op("endsWith") .LOG_OP_REG$`%contains%` <- .binary_op("contains") ## not conditional. .LOG_OP_REG$`!` <- .not_op("!") ## parenthesis .LOG_OP_REG$`(` <- .parenthesis_op("(") ## combine filters .LOG_OP_REG$`&` <- .combine_op("&") .LOG_OP_REG$`|` <- .combine_op("|") `%startsWith%` <- function(e1, e2){} `%endsWith%` <- function(e1, e2){} `%contains%` <- function(e1, e2){} #' @rdname AnnotationFilter #' #' @description \code{AnnotationFilter} \emph{translates} a filter #' expression such as \code{~ gene_id == "BCL2"} into a filter object #' extending the \code{\link{AnnotationFilter}} class (in the example a #' \code{\link{GeneIdFilter}} object) or an #' \code{\link{AnnotationFilterList}} if the expression contains multiple #' conditions (see examples below). Filter expressions have to be written #' in the form \code{~ }, with \code{} #' being the default field of the filter class (use the #' \code{supportedFilter} function to list all fields and filter classes), #' \code{} the logical expression and \code{} the value #' for the filter. #' #' @details Filter expressions for the \code{AnnotationFilter} class have to be #' written as formulas, i.e. starting with a \code{~}. #' #' @note Translation of nested filter expressions using the #' \code{AnnotationFilter} function is not yet supported. #' #' @param expr A filter expression, written as a \code{formula}, to be #' converted to an \code{AnnotationFilter} or \code{AnnotationFilterList} #' class. See below for examples. #' #' @return \code{AnnotationFilter} returns an #' \code{\link{AnnotationFilter}} or an \code{\link{AnnotationFilterList}}. #' #' @importFrom lazyeval f_eval #' #' @examples #' #' ## Convert a filter expression based on a gene ID to a GeneIdFilter #' gnf <- AnnotationFilter(~ gene_id == "BCL2") #' gnf #' #' ## Same conversion but for two gene IDs. #' gnf <- AnnotationFilter(~ gene_id %in% c("BCL2", "BCL2L11")) #' gnf #' #' ## Converting an expression that combines multiple filters. As a result we #' ## get an AnnotationFilterList containing the corresponding filters. #' ## Be aware that nesting of expressions/filters does not work. #' flt <- AnnotationFilter(~ gene_id %in% c("BCL2", "BCL2L11") & #' tx_biotype == "nonsense_mediated_decay" | #' seq_name == "Y") #' flt #' #' @export AnnotationFilter <- function(expr) { res <- f_eval(expr, data = .LOG_OP_REG) if(is(res, "AnnotationFilterList")) res@.groupingFlag <- FALSE res } AnnotationFilter/README0000644000175400017540000000217413175715601015752 0ustar00biocbuildbiocbuildPackage: AnnotationFilter Title: Facilities for Filtering Bioconductor Annotation Resources Version: 0.99.8 Authors@R: c( person("Martin", "Morgan", email = "martin.morgan@roswellpark.org", role = "aut"), person("Johannes", "Rainer", email = "johannes.rainer@eurac.edu", role = "aut"), person("Bioconductor", "Maintainer", email="maintainer@bioconductor.org", role = "cre")) URL: https://github.com/Bioconductor/AnnotationFilter BugReports: https://github.com/Bioconductor/AnnotationFilter/issues Description: This package provides class and other infrastructure to implement filters for manipulating Bioconductor annotation resources. The filters will be used by ensembldb, Organism.dplyr, and other packages. Depends: R (>= 3.4.0) Imports: utils, methods, GenomicRanges, lazyeval Suggests: BiocStyle, knitr, testthat, RSQLite, org.Hs.eg.db VignetteBuilder: knitr License: Artistic-2.0 biocViews: Annotation, Infrastructure, Software Encoding: UTF-8 LazyData: true RoxygenNote: 6.0.1 Collate: 'AllGenerics.R' 'AnnotationFilter.R' 'AnnotationFilterList.R' 'translate-utils.R' AnnotationFilter/build/0000755000175400017540000000000013175747732016177 5ustar00biocbuildbiocbuildAnnotationFilter/build/vignette.rds0000644000175400017540000000042213175747732020534 0ustar00biocbuildbiocbuildeQN0t} Qz rjmvUƗS6iU $SffKܦ7 I݃3/`i݊s_1.\5R74hNgv:FӾ|"XX#zdW6Qh Ӵ07o/!)ڑA$=?9F3 4`r1șKd*Ik_Xp;⊍r\lt!gtkQ|j+AnnotationFilter/inst/0000755000175400017540000000000013175747732016055 5ustar00biocbuildbiocbuildAnnotationFilter/inst/doc/0000755000175400017540000000000013175747732016622 5ustar00biocbuildbiocbuildAnnotationFilter/inst/doc/AnnotationFilter.R0000644000175400017540000001123613175747731022227 0ustar00biocbuildbiocbuild## ----style, echo = FALSE, results = 'asis', message=FALSE------------------ BiocStyle::markdown() ## ----supportedFilters------------------------------------------------------ library(AnnotationFilter) supportedFilters() ## ----symbol-filter--------------------------------------------------------- library(AnnotationFilter) smbl <- SymbolFilter("BCL2") smbl ## ----symbol-startsWith----------------------------------------------------- smbl <- SymbolFilter("BCL2", condition = "startsWith") smbl ## ----convert-expression---------------------------------------------------- smbl <- AnnotationFilter(~ symbol == "BCL2") smbl ## ----convert-multi-expression---------------------------------------------- flt <- AnnotationFilter(~ symbol == "BCL2" & tx_biotype == "protein_coding") flt ## ----nested-query---------------------------------------------------------- ## Define the filter query for the first pair of filters. afl1 <- AnnotationFilterList(SymbolFilter("BCL2L11"), TxBiotypeFilter("nonsense_mediated_decay")) ## Define the second filter pair in ( brackets should be combined. afl2 <- AnnotationFilterList(SymbolFilter("BCL2"), TxBiotypeFilter("protein_coding")) ## Now combine both with a logical OR afl <- AnnotationFilterList(afl1, afl2, logicOp = "|") afl ## ----define-data.frame----------------------------------------------------- ## Define a simple gene table gene <- data.frame(gene_id = 1:10, symbol = c(letters[1:9], "b"), seq_name = paste0("chr", c(1, 4, 4, 8, 1, 2, 5, 3, "X", 4)), stringsAsFactors = FALSE) gene ## ----simple-symbol--------------------------------------------------------- smbl <- SymbolFilter("b") ## ----simple-symbol-condition----------------------------------------------- condition(smbl) ## ----simple-symbol-value--------------------------------------------------- value(smbl) ## ----simple-symbol-field--------------------------------------------------- field(smbl) ## ----doMatch--------------------------------------------------------------- doMatch <- function(x, filter) { do.call(condition(filter), list(x[, field(filter)], value(filter))) } ## Apply this function doMatch(gene, smbl) ## ----doExtract------------------------------------------------------------- doExtract <- function(x, filter) { x[doMatch(x, filter), ] } ## Apply it on the data doExtract(gene, smbl) ## ----doMatch-formula------------------------------------------------------- doMatch <- function(x, filter) { if (is(filter, "formula")) filter <- AnnotationFilter(filter) do.call(condition(filter), list(x[, field(filter)], value(filter))) } doExtract(gene, ~ gene_id == '2') ## ----orgDb, message = FALSE------------------------------------------------ ## Load the required packages library(org.Hs.eg.db) library(RSQLite) ## Get the database connection dbcon <- org.Hs.eg_dbconn() ## What tables do we have? dbListTables(dbcon) ## ----gene_info------------------------------------------------------------- ## What fields are there in the gene_info table? dbListFields(dbcon, "gene_info") ## ----doExtractSQL---------------------------------------------------------- doExtractGene <- function(x, filter) { gene <- dbGetQuery(x, "select * from gene_info") doExtract(gene, filter) } ## Extract all entries for BCL2 bcl2 <- doExtractGene(dbcon, SymbolFilter("BCL2")) bcl2 ## ----simpleSQL------------------------------------------------------------- ## Define a simple function that covers some condition conversion conditionForSQL <- function(x) { switch(x, "==" = "=", x) } ## Define a function to translate a filter into an SQL where condition. ## Character values have to be quoted. where <- function(x) { if (is(x, "CharacterFilter")) value <- paste0("'", value(x), "'") else value <- value(x) paste0(field(x), conditionForSQL(condition(x)), value) } ## Now "translate" a filter using this function where(SeqNameFilter("Y")) ## ----doExtractGene2-------------------------------------------------------- ## Define a function that doExtractGene2 <- function(x, filter) { if (is(filter, "formula")) filter <- AnnotationFilter(filter) query <- paste0("select * from gene_info where ", where(filter)) dbGetQuery(x, query) } bcl2 <- doExtractGene2(dbcon, ~ symbol == "BCL2") bcl2 ## ----performance----------------------------------------------------------- system.time(doExtractGene(dbcon, ~ symbol == "BCL2")) system.time(doExtractGene2(dbcon, ~ symbol == "BCL2")) ## ----si-------------------------------------------------------------------- sessionInfo() AnnotationFilter/inst/doc/AnnotationFilter.Rmd0000644000175400017540000003560413175715601022544 0ustar00biocbuildbiocbuild--- title: "Facilities for Filtering Bioconductor Annotation Resources" output: BiocStyle::html_document2: toc_float: true vignette: > %\VignetteIndexEntry{Facilities for Filtering Bioconductor Annotation resources} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignettePackage{AnnotationFilter} %\VignetteDepends{org.Hs.eg.db,BiocStyle,RSQLite} --- ```{r style, echo = FALSE, results = 'asis', message=FALSE} BiocStyle::markdown() ``` **Package**: `r Biocpkg("AnnotationFilter")`
**Authors**: `r packageDescription("AnnotationFilter")[["Author"]] `
**Last modified:** `r file.info("AnnotationFilter.Rmd")$mtime`
**Compiled**: `r date()` # Introduction A large variety of annotation resources are available in Bioconductor. Accessing the full content of these databases or even of single tables is computationally expensive and in many instances not required, as users may want to extract only sub-sets of the data e.g. genomic coordinates of a single gene. In that respect, filtering annotation resources before data extraction has a major impact on performance and increases the usability of such genome-scale databases. The `r Biocpkg("AnnotationFilter")` package was thus developed to provide basic filter classes to enable a common filtering framework for Bioconductor annotation resources. `r Biocpkg("AnnotationFilter")` defines filter classes for some of the most commonly used features in annotation databases, such as *symbol* or *genename*. Each filter class is supposed to work on a single database table column and to facilitate filtering on the provided values. Such filter classes enable the user to build complex queries to retrieve specific annotations without needing to know column or table names or the layout of the underlying databases. While initially being developed to be used in the `r Biocpkg("Organism.dplyr")` and `r Biocpkg("ensembldb")` packages, the filter classes and the related filtering concept can be easily added to other annotation packages too. # Filter classes All filter classes extend the basic `AnnotationFilter` class and take one or more *values* and a *condition* to allow filtering on a single database table column. Based on the type of the input value, filter classes are divided into: - `CharacterFilter`: takes a `character` value of length >= 1 and supports conditions `==`, `!=`, `startsWith` and `endsWith`. An example would be a `GeneIdFilter` that allows to filter on gene IDs. - `IntegerFilter`: takes a single `integer` as input and supports the conditions `==`, `!=`, `>`, `<`, `>=` and `<=`. An example would be a `GeneStartFilter` that filters results on the (chromosomal) start coordinates of genes. - `GRangesFilter`: is a special filter, as it takes a `GRanges` as `value` and performs the filtering on a combination of columns (i.e. start and end coordinate as well as sequence name and strand). To be consistent with the `findOverlaps` method from the `r Biocpkg("IRanges")` package, the constructor of the `GRangesFilter` filter takes a `type` argument to define its condition. Supported values are `"any"` (the default) that retrieves all entries overlapping the `GRanges`, `"start"` and `"end"` matching all features with the same start and end coordinate respectively, `"within"` that matches all features that are *within* the range defined by the `GRanges` and `"equal"` that returns features that are equal to the `GRanges`. The names of the filter classes are intuitive, the first part corresponding to the database column name with each character following a `_` being capitalized, followed by the key word `Filter`. The name of a filter for a database table column `gene_id` is thus called `GeneIdFilter`. The default database column for a filter is stored in its `field` slot (accessible *via* the `field` method). The `supportedFilters` method can be used to get an overview of all available filter objects defined in `AnnotationFilter`. ```{r supportedFilters} library(AnnotationFilter) supportedFilters() ``` Note that the `AnnotationFilter` package does provides only the filter classes but not the functionality to apply the filtering. Such functionality is annotation resource and database layout dependent and needs thus to be implemented in the packages providing access to annotation resources. # Usage Filters are created *via* their dedicated constructor functions, such as the `GeneIdFilter` function for the `GeneIdFilter` class. Because of this simple and cheap creation, filter classes are thought to be *read-only* and thus don't provide *setter* methods to change their slot values. In addition to the constructor functions, `AnnotationFilter` provides the functionality to *translate* query expressions into filter classes (see further below for an example). Below we create a `SymbolFilter` that could be used to filter an annotation resource to retrieve all entries associated with the specified symbol value(s). ```{r symbol-filter} library(AnnotationFilter) smbl <- SymbolFilter("BCL2") smbl ``` Such a filter is supposed to be used to retrieve all entries associated to features with a value in a database table column called *symbol* matching the filter's value `"BCL2"`. Using the `"startsWith"` condition we could define a filter to retrieve all entries for genes with a gene name/symbol starting with the specified value (e.g. `"BCL2"` and `"BCL2L11"` for the example below. ```{r symbol-startsWith} smbl <- SymbolFilter("BCL2", condition = "startsWith") smbl ``` In addition to the constructor functions, `AnnotationFilter` provides a functionality to create filter instances in a more natural and intuitive way by *translating* filter expressions (written as a *formula*, i.e. starting with a `~`). ```{r convert-expression} smbl <- AnnotationFilter(~ symbol == "BCL2") smbl ``` Individual `AnnotationFilter` objects can be combined in an `AnnotationFilterList`. This class extends `list` and provides an additional `logicOp()` that defines how its individual filters are supposed to be combined. The length of `logicOp()` has to be 1 less than the number of filter objects. Each element in `logicOp()` defines how two consecutive filters should be combined. Below we create a `AnnotationFilterList` containing two filter objects to be combined with a logical *AND*. ```{r convert-multi-expression} flt <- AnnotationFilter(~ symbol == "BCL2" & tx_biotype == "protein_coding") flt ``` Note that the `AnnotationFilter` function does not (yet) support translation of nested expressions, such as `(symbol == "BCL2L11" & tx_biotype == "nonsense_mediated_decay") | (symbol == "BCL2" & tx_biotype == "protein_coding")`. Such queries can however be build by nesting `AnnotationFilterList` classes. ```{r nested-query} ## Define the filter query for the first pair of filters. afl1 <- AnnotationFilterList(SymbolFilter("BCL2L11"), TxBiotypeFilter("nonsense_mediated_decay")) ## Define the second filter pair in ( brackets should be combined. afl2 <- AnnotationFilterList(SymbolFilter("BCL2"), TxBiotypeFilter("protein_coding")) ## Now combine both with a logical OR afl <- AnnotationFilterList(afl1, afl2, logicOp = "|") afl ``` This `AnnotationFilterList` would now select all entries for all transcripts of the gene *BCL2L11* with the biotype *nonsense_mediated_decay* or for all protein coding transcripts of the gene *BCL2*. # Using `AnnotationFilter` in other packages The `AnnotationFilter` package does only provide filter classes, but no filtering functionality. This has to be implemented in the package using the filters. In this section we first show in a very simple example how `AnnotationFilter` classes could be used to filter a `data.frame` and subsequently explore how a simple filter framework could be implemented for a SQL based annotation resources. Let's first define a simple `data.frame` containing the data we want to filter. Note that subsetting this `data.frame` using `AnnotationFilter` is obviously not the best solution, but it should help to understand the basic concept. ```{r define-data.frame} ## Define a simple gene table gene <- data.frame(gene_id = 1:10, symbol = c(letters[1:9], "b"), seq_name = paste0("chr", c(1, 4, 4, 8, 1, 2, 5, 3, "X", 4)), stringsAsFactors = FALSE) gene ``` Next we generate a `SymbolFilter` and inspect what information we can extract from it. ```{r simple-symbol} smbl <- SymbolFilter("b") ``` We can access the filter *condition* using the `condition` method ```{r simple-symbol-condition} condition(smbl) ``` The value of the filter using the `value` method ```{r simple-symbol-value} value(smbl) ``` And finally the *field* (i.e. column in the data table) using the `field` method. ```{r simple-symbol-field} field(smbl) ``` With this information we can define a simple function that takes the data table and the filter as input and returns a `logical` with length equal to the number of rows of the table, `TRUE` for rows matching the filter. ```{r doMatch} doMatch <- function(x, filter) { do.call(condition(filter), list(x[, field(filter)], value(filter))) } ## Apply this function doMatch(gene, smbl) ``` Note that this simple function does not support multiple filters and also not conditions `"startsWith"` or `"endsWith"`. Next we define a second function that extracts the relevant data from the data resource. ```{r doExtract} doExtract <- function(x, filter) { x[doMatch(x, filter), ] } ## Apply it on the data doExtract(gene, smbl) ``` We could even modify the `doMatch` function to enable filter expressions. ```{r doMatch-formula} doMatch <- function(x, filter) { if (is(filter, "formula")) filter <- AnnotationFilter(filter) do.call(condition(filter), list(x[, field(filter)], value(filter))) } doExtract(gene, ~ gene_id == '2') ``` For such simple examples `AnnotationFilter` might be an overkill as the same could be achieved (much simpler) using standard R operations. A real case scenario in which `AnnotationFilter` becomes useful are SQL-based annotation resources. We will thus explore next how SQL resources could be filtered using `AnnotationFilter`. We use the SQLite database from the `r Biocpkg("org.Hs.eg.db")` package that provides a variety of annotations for all human genes. Using the packages' connection to the database we inspect first what database tables are available and then select one for our simple filtering example. We use an `EnsDb` SQLite database used by the `r Biocpkg("ensembldb")` package and implement simple filter functions to extract specific data from one of its database tables. We thus load below the `EnsDb.Hsapiens.v75` package that provides access to human gene, transcript, exon and protein annotations. Using its connection to the database we inspect first what database tables are available and then what *fields* (i.e. columns) the *gene* table has. ```{r orgDb, message = FALSE} ## Load the required packages library(org.Hs.eg.db) library(RSQLite) ## Get the database connection dbcon <- org.Hs.eg_dbconn() ## What tables do we have? dbListTables(dbcon) ``` `org.Hs.eg.db` provides many different tables, one for each identifier or annotation resource. We will use the *gene_info* table and determine which *fields* (i.e. columns) the table provides. ```{r gene_info} ## What fields are there in the gene_info table? dbListFields(dbcon, "gene_info") ``` The *gene_info* table provides the official gene symbol and the gene name. The column *symbol* matches the default `field` value of the `SymbolFilter`. For the `GenenameFilter` we would have to re-map its default field `"genename"` to the database column *gene_name*. There are many possibilities to do this, one would be to implement an own function to extract the field from the `AnnotationFilter` classes specific to the database. This function eventually renames the extracted field value to match the corresponding name of the database column name. We next implement a simple `doExtractGene` function that retrieves data from the *gene_info* table and re-uses the `doFilter` function to extract specific data. The parameter `x` is now the database connection object. ```{r doExtractSQL} doExtractGene <- function(x, filter) { gene <- dbGetQuery(x, "select * from gene_info") doExtract(gene, filter) } ## Extract all entries for BCL2 bcl2 <- doExtractGene(dbcon, SymbolFilter("BCL2")) bcl2 ``` This works, but is not really efficient, since the function first fetches the full database table and subsets it only afterwards. A much more efficient solution is to *translate* the `AnnotationFilter` class(es) to an SQL *where* condition and hence perform the filtering on the database level. Here we have to do some small modifications, since not all condition values can be used 1:1 in SQL calls. The condition `"=="` has for example to be converted into `"="` and the `"startsWith"` into a SQL `"like"` by adding also a `"%"` wildcard to the value of the filter. We would also have to deal with filters that have a `value` of length > 1. A `SymbolFilter` with a `value` being `c("BCL2", "BCL2L11")` would for example have to be converted to a SQL call `"symbol in ('BCL2','BCL2L11')"`. Here we skip these special cases and define a simple function that translates an `AnnotationFilter` to a *where* condition to be included into the SQL call. Depending on whether the filter extends `CharacterFilter` or `IntegerFilter` the value has also to be quoted. ```{r simpleSQL} ## Define a simple function that covers some condition conversion conditionForSQL <- function(x) { switch(x, "==" = "=", x) } ## Define a function to translate a filter into an SQL where condition. ## Character values have to be quoted. where <- function(x) { if (is(x, "CharacterFilter")) value <- paste0("'", value(x), "'") else value <- value(x) paste0(field(x), conditionForSQL(condition(x)), value) } ## Now "translate" a filter using this function where(SeqNameFilter("Y")) ``` Next we implement a new function which integrates the filter into the SQL call to let the database server take care of the filtering. ```{r doExtractGene2} ## Define a function that doExtractGene2 <- function(x, filter) { if (is(filter, "formula")) filter <- AnnotationFilter(filter) query <- paste0("select * from gene_info where ", where(filter)) dbGetQuery(x, query) } bcl2 <- doExtractGene2(dbcon, ~ symbol == "BCL2") bcl2 ``` Below we compare the performance of both approaches. ```{r performance} system.time(doExtractGene(dbcon, ~ symbol == "BCL2")) system.time(doExtractGene2(dbcon, ~ symbol == "BCL2")) ``` Not surprisingly, the second approach is much faster. Be aware that the examples shown here are only for illustration purposes. In a real world situation additional factors, like combinations of filters, which database tables to join, which columns to be returned etc would have to be considered too. # Session information ```{r si} sessionInfo() ``` AnnotationFilter/inst/doc/AnnotationFilter.html0000644000175400017540000402764413175747732023011 0ustar00biocbuildbiocbuild Facilities for Filtering Bioconductor Annotation Resources

Package: AnnotationFilter
Authors: Martin Morgan [aut], Johannes Rainer [aut], Joachim Bargsten [ctb], Daniel Van Twisk [ctb], Bioconductor Maintainer [cre]
Last modified: 2017-10-30 17:37:05
Compiled: Mon Oct 30 21:20:21 2017

1 Introduction

A large variety of annotation resources are available in Bioconductor. Accessing the full content of these databases or even of single tables is computationally expensive and in many instances not required, as users may want to extract only sub-sets of the data e.g. genomic coordinates of a single gene. In that respect, filtering annotation resources before data extraction has a major impact on performance and increases the usability of such genome-scale databases.

The AnnotationFilter package was thus developed to provide basic filter classes to enable a common filtering framework for Bioconductor annotation resources. AnnotationFilter defines filter classes for some of the most commonly used features in annotation databases, such as symbol or genename. Each filter class is supposed to work on a single database table column and to facilitate filtering on the provided values. Such filter classes enable the user to build complex queries to retrieve specific annotations without needing to know column or table names or the layout of the underlying databases. While initially being developed to be used in the Organism.dplyr and ensembldb packages, the filter classes and the related filtering concept can be easily added to other annotation packages too.

2 Filter classes

All filter classes extend the basic AnnotationFilter class and take one or more values and a condition to allow filtering on a single database table column. Based on the type of the input value, filter classes are divided into:

  • CharacterFilter: takes a character value of length >= 1 and supports conditions ==, !=, startsWith and endsWith. An example would be a GeneIdFilter that allows to filter on gene IDs.

  • IntegerFilter: takes a single integer as input and supports the conditions ==, !=, >, <, >= and <=. An example would be a GeneStartFilter that filters results on the (chromosomal) start coordinates of genes.

  • GRangesFilter: is a special filter, as it takes a GRanges as value and performs the filtering on a combination of columns (i.e. start and end coordinate as well as sequence name and strand). To be consistent with the findOverlaps method from the IRanges package, the constructor of the GRangesFilter filter takes a type argument to define its condition. Supported values are "any" (the default) that retrieves all entries overlapping the GRanges, "start" and "end" matching all features with the same start and end coordinate respectively, "within" that matches all features that are within the range defined by the GRanges and "equal" that returns features that are equal to the GRanges.

The names of the filter classes are intuitive, the first part corresponding to the database column name with each character following a _ being capitalized, followed by the key word Filter. The name of a filter for a database table column gene_id is thus called GeneIdFilter. The default database column for a filter is stored in its field slot (accessible via the field method).

The supportedFilters method can be used to get an overview of all available filter objects defined in AnnotationFilter.

library(AnnotationFilter)
supportedFilters()
##               filter        field
## 16      CdsEndFilter      cds_end
## 15    CdsStartFilter    cds_start
## 6       EntrezFilter       entrez
## 19     ExonEndFilter     exon_end
## 1       ExonIdFilter      exon_id
## 2     ExonNameFilter    exon_name
## 18    ExonRankFilter    exon_rank
## 17   ExonStartFilter   exon_start
## 24     GRangesFilter      granges
## 5  GeneBiotypeFilter gene_biotype
## 21     GeneEndFilter     gene_end
## 3       GeneIdFilter      gene_id
## 20   GeneStartFilter   gene_start
## 4     GenenameFilter     genename
## 11   ProteinIdFilter   protein_id
## 13     SeqNameFilter     seq_name
## 14   SeqStrandFilter   seq_strand
## 7       SymbolFilter       symbol
## 10   TxBiotypeFilter   tx_biotype
## 23       TxEndFilter       tx_end
## 8         TxIdFilter        tx_id
## 9       TxNameFilter      tx_name
## 22     TxStartFilter     tx_start
## 12     UniprotFilter      uniprot

Note that the AnnotationFilter package does provides only the filter classes but not the functionality to apply the filtering. Such functionality is annotation resource and database layout dependent and needs thus to be implemented in the packages providing access to annotation resources.

3 Usage

Filters are created via their dedicated constructor functions, such as the GeneIdFilter function for the GeneIdFilter class. Because of this simple and cheap creation, filter classes are thought to be read-only and thus don’t provide setter methods to change their slot values. In addition to the constructor functions, AnnotationFilter provides the functionality to translate query expressions into filter classes (see further below for an example).

Below we create a SymbolFilter that could be used to filter an annotation resource to retrieve all entries associated with the specified symbol value(s).

library(AnnotationFilter)

smbl <- SymbolFilter("BCL2")
smbl
## class: SymbolFilter 
## condition: == 
## value: BCL2

Such a filter is supposed to be used to retrieve all entries associated to features with a value in a database table column called symbol matching the filter’s value "BCL2".

Using the "startsWith" condition we could define a filter to retrieve all entries for genes with a gene name/symbol starting with the specified value (e.g. "BCL2" and "BCL2L11" for the example below.

smbl <- SymbolFilter("BCL2", condition = "startsWith")
smbl
## class: SymbolFilter 
## condition: startsWith 
## value: BCL2

In addition to the constructor functions, AnnotationFilter provides a functionality to create filter instances in a more natural and intuitive way by translating filter expressions (written as a formula, i.e. starting with a ~).

smbl <- AnnotationFilter(~ symbol == "BCL2")
smbl
## class: SymbolFilter 
## condition: == 
## value: BCL2

Individual AnnotationFilter objects can be combined in an AnnotationFilterList. This class extends list and provides an additional logicOp() that defines how its individual filters are supposed to be combined. The length of logicOp() has to be 1 less than the number of filter objects. Each element in logicOp() defines how two consecutive filters should be combined. Below we create a AnnotationFilterList containing two filter objects to be combined with a logical AND.

flt <- AnnotationFilter(~ symbol == "BCL2" &
                            tx_biotype == "protein_coding")
flt
## AnnotationFilterList of length 2 
## symbol == 'BCL2' & tx_biotype == 'protein_coding'

Note that the AnnotationFilter function does not (yet) support translation of nested expressions, such as (symbol == "BCL2L11" & tx_biotype == "nonsense_mediated_decay") | (symbol == "BCL2" & tx_biotype == "protein_coding"). Such queries can however be build by nesting AnnotationFilterList classes.

## Define the filter query for the first pair of filters.
afl1 <- AnnotationFilterList(SymbolFilter("BCL2L11"),
                             TxBiotypeFilter("nonsense_mediated_decay"))
## Define the second filter pair in ( brackets should be combined.
afl2 <- AnnotationFilterList(SymbolFilter("BCL2"),
                             TxBiotypeFilter("protein_coding"))
## Now combine both with a logical OR
afl <- AnnotationFilterList(afl1, afl2, logicOp = "|")

afl
## AnnotationFilterList of length 2 
## (symbol == 'BCL2L11' & tx_biotype == 'nonsense_mediated_decay') | (symbol == 'BCL2' & tx_biotype == 'protein_coding')

This AnnotationFilterList would now select all entries for all transcripts of the gene BCL2L11 with the biotype nonsense_mediated_decay or for all protein coding transcripts of the gene BCL2.

4 Using AnnotationFilter in other packages

The AnnotationFilter package does only provide filter classes, but no filtering functionality. This has to be implemented in the package using the filters. In this section we first show in a very simple example how AnnotationFilter classes could be used to filter a data.frame and subsequently explore how a simple filter framework could be implemented for a SQL based annotation resources.

Let’s first define a simple data.frame containing the data we want to filter. Note that subsetting this data.frame using AnnotationFilter is obviously not the best solution, but it should help to understand the basic concept.

## Define a simple gene table
gene <- data.frame(gene_id = 1:10,
                   symbol = c(letters[1:9], "b"),
                   seq_name = paste0("chr", c(1, 4, 4, 8, 1, 2, 5, 3, "X", 4)),
                   stringsAsFactors = FALSE)
gene
##    gene_id symbol seq_name
## 1        1      a     chr1
## 2        2      b     chr4
## 3        3      c     chr4
## 4        4      d     chr8
## 5        5      e     chr1
## 6        6      f     chr2
## 7        7      g     chr5
## 8        8      h     chr3
## 9        9      i     chrX
## 10      10      b     chr4

Next we generate a SymbolFilter and inspect what information we can extract from it.

smbl <- SymbolFilter("b")

We can access the filter condition using the condition method

condition(smbl)
## [1] "=="

The value of the filter using the value method

value(smbl)
## [1] "b"

And finally the field (i.e. column in the data table) using the field method.

field(smbl)
## [1] "symbol"

With this information we can define a simple function that takes the data table and the filter as input and returns a logical with length equal to the number of rows of the table, TRUE for rows matching the filter.

doMatch <- function(x, filter) {
    do.call(condition(filter), list(x[, field(filter)], value(filter)))
}

## Apply this function
doMatch(gene, smbl)
##  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Note that this simple function does not support multiple filters and also not conditions "startsWith" or "endsWith". Next we define a second function that extracts the relevant data from the data resource.

doExtract <- function(x, filter) {
    x[doMatch(x, filter), ]
}

## Apply it on the data
doExtract(gene, smbl)
##    gene_id symbol seq_name
## 2        2      b     chr4
## 10      10      b     chr4

We could even modify the doMatch function to enable filter expressions.

doMatch <- function(x, filter) {
    if (is(filter, "formula"))
        filter <- AnnotationFilter(filter)
    do.call(condition(filter), list(x[, field(filter)], value(filter)))
}

doExtract(gene, ~ gene_id == '2')
##   gene_id symbol seq_name
## 2       2      b     chr4

For such simple examples AnnotationFilter might be an overkill as the same could be achieved (much simpler) using standard R operations. A real case scenario in which AnnotationFilter becomes useful are SQL-based annotation resources. We will thus explore next how SQL resources could be filtered using AnnotationFilter.

We use the SQLite database from the org.Hs.eg.db package that provides a variety of annotations for all human genes. Using the packages’ connection to the database we inspect first what database tables are available and then select one for our simple filtering example.

We use an EnsDb SQLite database used by the ensembldb package and implement simple filter functions to extract specific data from one of its database tables. We thus load below the EnsDb.Hsapiens.v75 package that provides access to human gene, transcript, exon and protein annotations. Using its connection to the database we inspect first what database tables are available and then what fields (i.e. columns) the gene table has.

## Load the required packages
library(org.Hs.eg.db)
library(RSQLite)
## Get the database connection
dbcon <- org.Hs.eg_dbconn()

## What tables do we have?
dbListTables(dbcon)
##  [1] "accessions"            "alias"                 "chrlengths"           
##  [4] "chromosome_locations"  "chromosomes"           "cytogenetic_locations"
##  [7] "ec"                    "ensembl"               "ensembl2ncbi"         
## [10] "ensembl_prot"          "ensembl_trans"         "gene_info"            
## [13] "genes"                 "go"                    "go_all"               
## [16] "go_bp"                 "go_bp_all"             "go_cc"                
## [19] "go_cc_all"             "go_mf"                 "go_mf_all"            
## [22] "kegg"                  "map_counts"            "map_metadata"         
## [25] "metadata"              "ncbi2ensembl"          "omim"                 
## [28] "pfam"                  "prosite"               "pubmed"               
## [31] "refseq"                "sqlite_stat1"          "sqlite_stat4"         
## [34] "ucsc"                  "unigene"               "uniprot"

org.Hs.eg.db provides many different tables, one for each identifier or annotation resource. We will use the gene_info table and determine which fields (i.e. columns) the table provides.

## What fields are there in the gene_info table?
dbListFields(dbcon, "gene_info")
## [1] "_id"       "gene_name" "symbol"

The gene_info table provides the official gene symbol and the gene name. The column symbol matches the default field value of the SymbolFilter. For the GenenameFilter we would have to re-map its default field "genename" to the database column gene_name. There are many possibilities to do this, one would be to implement an own function to extract the field from the AnnotationFilter classes specific to the database. This function eventually renames the extracted field value to match the corresponding name of the database column name.

We next implement a simple doExtractGene function that retrieves data from the gene_info table and re-uses the doFilter function to extract specific data. The parameter x is now the database connection object.

doExtractGene <- function(x, filter) {
    gene <- dbGetQuery(x, "select * from gene_info")
    doExtract(gene, filter)
}

## Extract all entries for BCL2
bcl2 <- doExtractGene(dbcon, SymbolFilter("BCL2"))

bcl2
##     _id                 gene_name symbol
## 487 487 BCL2, apoptosis regulator   BCL2

This works, but is not really efficient, since the function first fetches the full database table and subsets it only afterwards. A much more efficient solution is to translate the AnnotationFilter class(es) to an SQL where condition and hence perform the filtering on the database level. Here we have to do some small modifications, since not all condition values can be used 1:1 in SQL calls. The condition "==" has for example to be converted into "=" and the "startsWith" into a SQL "like" by adding also a "%" wildcard to the value of the filter. We would also have to deal with filters that have a value of length > 1. A SymbolFilter with a value being c("BCL2", "BCL2L11") would for example have to be converted to a SQL call "symbol in ('BCL2','BCL2L11')". Here we skip these special cases and define a simple function that translates an AnnotationFilter to a where condition to be included into the SQL call. Depending on whether the filter extends CharacterFilter or IntegerFilter the value has also to be quoted.

## Define a simple function that covers some condition conversion
conditionForSQL <- function(x) {
    switch(x,
           "==" = "=",
           x)
}

## Define a function to translate a filter into an SQL where condition.
## Character values have to be quoted.
where <- function(x) {
    if (is(x, "CharacterFilter"))
        value <- paste0("'", value(x), "'")
    else value <- value(x)
    paste0(field(x), conditionForSQL(condition(x)), value)
}

## Now "translate" a filter using this function
where(SeqNameFilter("Y"))
## [1] "seq_name='Y'"

Next we implement a new function which integrates the filter into the SQL call to let the database server take care of the filtering.

## Define a function that 
doExtractGene2 <- function(x, filter) {
    if (is(filter, "formula"))
        filter <- AnnotationFilter(filter)
    query <- paste0("select * from gene_info where ", where(filter))
    dbGetQuery(x, query)
}

bcl2 <- doExtractGene2(dbcon, ~ symbol == "BCL2")
bcl2
##   _id                 gene_name symbol
## 1 487 BCL2, apoptosis regulator   BCL2

Below we compare the performance of both approaches.

system.time(doExtractGene(dbcon, ~ symbol == "BCL2"))
##    user  system elapsed 
##   0.096   0.000   0.095
system.time(doExtractGene2(dbcon, ~ symbol == "BCL2"))
##    user  system elapsed 
##   0.012   0.000   0.011

Not surprisingly, the second approach is much faster.

Be aware that the examples shown here are only for illustration purposes. In a real world situation additional factors, like combinations of filters, which database tables to join, which columns to be returned etc would have to be considered too.

5 Session information

sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] RSQLite_2.0            org.Hs.eg.db_3.4.2     AnnotationDbi_1.40.0  
## [4] IRanges_2.12.0         S4Vectors_0.16.0       Biobase_2.38.0        
## [7] BiocGenerics_0.24.0    AnnotationFilter_1.2.0 BiocStyle_2.6.0       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.13            knitr_1.17              XVector_0.18.0         
##  [4] magrittr_1.5            GenomicRanges_1.30.0    zlibbioc_1.24.0        
##  [7] bit_1.1-12              rlang_0.1.2             blob_1.1.0             
## [10] stringr_1.2.0           GenomeInfoDb_1.14.0     tools_3.4.2            
## [13] DBI_0.7                 htmltools_0.3.6         bit64_0.9-7            
## [16] lazyeval_0.2.1          yaml_2.1.14             rprojroot_1.2          
## [19] digest_0.6.12           tibble_1.3.4            bookdown_0.5           
## [22] GenomeInfoDbData_0.99.1 bitops_1.0-6            RCurl_1.95-4.8         
## [25] memoise_1.1.0           evaluate_0.10.1         rmarkdown_1.6          
## [28] stringi_1.1.5           compiler_3.4.2          backports_1.1.1        
## [31] pkgconfig_2.0.1
AnnotationFilter/man/0000755000175400017540000000000013175715601015641 5ustar00biocbuildbiocbuildAnnotationFilter/man/AnnotationFilter.Rd0000644000175400017540000002141613175715601021414 0ustar00biocbuildbiocbuild% Generated by roxygen2: do not edit by hand % Please edit documentation in R/AnnotationFilter.R, R/translate-utils.R \docType{methods} \name{AnnotationFilter} \alias{AnnotationFilter} \alias{CdsStartFilter} \alias{CdsEndFilter} \alias{ExonIdFilter} \alias{ExonNameFilter} \alias{ExonStartFilter} \alias{ExonEndFilter} \alias{ExonRankFilter} \alias{GeneIdFilter} \alias{GenenameFilter} \alias{GeneBiotypeFilter} \alias{GeneStartFilter} \alias{GeneEndFilter} \alias{EntrezFilter} \alias{SymbolFilter} \alias{TxIdFilter} \alias{TxNameFilter} \alias{TxBiotypeFilter} \alias{TxStartFilter} \alias{TxEndFilter} \alias{ProteinIdFilter} \alias{UniprotFilter} \alias{SeqNameFilter} \alias{SeqStrandFilter} \alias{AnnotationFilter-class} \alias{CharacterFilter-class} \alias{IntegerFilter-class} \alias{CdsStartFilter-class} \alias{CdsEndFilter-class} \alias{ExonIdFilter-class} \alias{ExonNameFilter-class} \alias{ExonStartFilter-class} \alias{ExonEndFilter-class} \alias{ExonRankFilter-class} \alias{GeneIdFilter-class} \alias{GenenameFilter-class} \alias{GeneBiotypeFilter-class} \alias{GeneStartFilter-class} \alias{GeneEndFilter-class} \alias{EntrezFilter-class} \alias{SymbolFilter-class} \alias{TxIdFilter-class} \alias{TxNameFilter-class} \alias{TxBiotypeFilter-class} \alias{TxStartFilter-class} \alias{TxEndFilter-class} \alias{ProteinIdFilter-class} \alias{UniprotFilter-class} \alias{SeqNameFilter-class} \alias{SeqStrandFilter-class} \alias{supportedFilters} \alias{show,AnnotationFilter-method} \alias{show,CharacterFilter-method} \alias{show,IntegerFilter-method} \alias{show,GRangesFilter-method} \alias{condition,AnnotationFilter-method} \alias{condition} \alias{value,AnnotationFilter-method} \alias{value} \alias{field,AnnotationFilter-method} \alias{field} \alias{not,AnnotationFilter-method} \alias{GRangesFilter-class} \alias{.GRangesFilter} \alias{GRangesFilter} \alias{feature} \alias{AnnotationFilter} \alias{convertFilter,AnnotationFilter,missing-method} \alias{supportedFilters,missing-method} \alias{AnnotationFilter} \title{Filters for annotation objects} \usage{ CdsStartFilter(value, condition = "==", not = FALSE) CdsEndFilter(value, condition = "==", not = FALSE) ExonIdFilter(value, condition = "==", not = FALSE) ExonNameFilter(value, condition = "==", not = FALSE) ExonRankFilter(value, condition = "==", not = FALSE) ExonStartFilter(value, condition = "==", not = FALSE) ExonEndFilter(value, condition = "==", not = FALSE) GeneIdFilter(value, condition = "==", not = FALSE) GenenameFilter(value, condition = "==", not = FALSE) GeneBiotypeFilter(value, condition = "==", not = FALSE) GeneStartFilter(value, condition = "==", not = FALSE) GeneEndFilter(value, condition = "==", not = FALSE) EntrezFilter(value, condition = "==", not = FALSE) SymbolFilter(value, condition = "==", not = FALSE) TxIdFilter(value, condition = "==", not = FALSE) TxNameFilter(value, condition = "==", not = FALSE) TxBiotypeFilter(value, condition = "==", not = FALSE) TxStartFilter(value, condition = "==", not = FALSE) TxEndFilter(value, condition = "==", not = FALSE) ProteinIdFilter(value, condition = "==", not = FALSE) UniprotFilter(value, condition = "==", not = FALSE) SeqNameFilter(value, condition = "==", not = FALSE) SeqStrandFilter(value, condition = "==", not = FALSE) \S4method{condition}{AnnotationFilter}(object) \S4method{value}{AnnotationFilter}(object) \S4method{field}{AnnotationFilter}(object) \S4method{not}{AnnotationFilter}(object) GRangesFilter(value, feature = "gene", type = c("any", "start", "end", "within", "equal")) feature(object) \S4method{convertFilter}{AnnotationFilter,missing}(object) \S4method{supportedFilters}{missing}(object) AnnotationFilter(expr) } \arguments{ \item{object}{An \code{AnnotationFilter} object.} \item{value}{\code{character()}, \code{integer()}, or \code{GRanges()} value for the filter} \item{feature}{\code{character(1)} defining on what feature the \code{GRangesFilter} should be applied. Choices could be \code{"gene"}, \code{"tx"} or \code{"exon"}.} \item{type}{\code{character(1)} indicating how overlaps are to be filtered. See \code{findOverlaps} in the IRanges package for a description of this argument.} \item{expr}{A filter expression, written as a \code{formula}, to be converted to an \code{AnnotationFilter} or \code{AnnotationFilterList} class. See below for examples.} \item{condition}{\code{character(1)} defining the condition to be used in the filter. For \code{IntegerFilter}, one of \code{"=="}, \code{"!="}, \code{">"}, \code{"<"}, \code{">="} or \code{"<="}. For \code{CharacterFilter}, one of \code{"=="}, \code{"!="}, \code{"startsWith"}, \code{"endsWith"} or \code{"contains"}. Default condition is \code{"=="}.} \item{not}{\code{logical(1)} whether the \code{AnnotationFilter} is negated. \code{TRUE} indicates is negated (!). \code{FALSE} indicates not negated. Default not is \code{FALSE}.} } \value{ The constructor function return an object extending \code{AnnotationFilter}. For the return value of the other methods see the methods' descriptions. \code{character(1)} that can be used as input to a \code{dplyr} filter. \code{AnnotationFilter} returns an \code{\link{AnnotationFilter}} or an \code{\link{AnnotationFilterList}}. } \description{ The filters extending the base \code{AnnotationFilter} class represent a simple filtering concept for annotation resources. Each filter object is thought to filter on a single (database) table column using the provided values and the defined condition. Filter instances created using the constructor functions (e.g. \code{GeneIdFilter}). \code{supportedFilters()} lists all defined filters. It returns a two column \code{data.frame} with the filter class name and its default field. Packages using \code{AnnotationFilter} should implement the \code{supportedFilters} for their annotation resource object (e.g. for \code{object = "EnsDb"} in the \code{ensembldb} package) to list all supported filters for the specific resource. \code{condition()} get the \code{condition} value for the filter \code{object}. \code{value()} get the \code{value} for the filter \code{object}. \code{field()} get the \code{field} for the filter \code{object}. \code{not()} get the \code{not} for the filter \code{object}. \code{feature()} get the \code{feature} for the \code{GRangesFilter} \code{object}. Converts an \code{AnnotationFilter} object to a \code{character(1)} giving an equation that can be used as input to a \code{dplyr} filter. \code{AnnotationFilter} \emph{translates} a filter expression such as \code{~ gene_id == "BCL2"} into a filter object extending the \code{\link{AnnotationFilter}} class (in the example a \code{\link{GeneIdFilter}} object) or an \code{\link{AnnotationFilterList}} if the expression contains multiple conditions (see examples below). Filter expressions have to be written in the form \code{~ }, with \code{} being the default field of the filter class (use the \code{supportedFilter} function to list all fields and filter classes), \code{} the logical expression and \code{} the value for the filter. } \details{ By default filters are only available for tables containing the field on which the filter acts (i.e. that contain a column with the name matching the value of the \code{field} slot of the object). See the vignette for a description to use filters for databases in which the database table column name differs from the default \code{field} of the filter. Filter expressions for the \code{AnnotationFilter} class have to be written as formulas, i.e. starting with a \code{~}. } \note{ Translation of nested filter expressions using the \code{AnnotationFilter} function is not yet supported. } \examples{ ## filter by GRanges GRangesFilter(GenomicRanges::GRanges("chr10:87869000-87876000")) ## Create a SymbolFilter to filter on a gene's symbol. sf <- SymbolFilter("BCL2") sf ## Create a GeneStartFilter to filter based on the genes' chromosomal start ## coordinates gsf <- GeneStartFilter(10000, condition = ">") gsf filter <- SymbolFilter("ADA", "==") result <- convertFilter(filter) result supportedFilters() ## Convert a filter expression based on a gene ID to a GeneIdFilter gnf <- AnnotationFilter(~ gene_id == "BCL2") gnf ## Same conversion but for two gene IDs. gnf <- AnnotationFilter(~ gene_id \%in\% c("BCL2", "BCL2L11")) gnf ## Converting an expression that combines multiple filters. As a result we ## get an AnnotationFilterList containing the corresponding filters. ## Be aware that nesting of expressions/filters does not work. flt <- AnnotationFilter(~ gene_id \%in\% c("BCL2", "BCL2L11") & tx_biotype == "nonsense_mediated_decay" | seq_name == "Y") flt } \seealso{ \code{\link{AnnotationFilterList}} for combining \code{AnnotationFilter} objects. } AnnotationFilter/man/AnnotationFilterList.Rd0000644000175400017540000001212613175715601022246 0ustar00biocbuildbiocbuild% Generated by roxygen2: do not edit by hand % Please edit documentation in R/AnnotationFilterList.R \docType{methods} \name{AnnotationFilterList} \alias{AnnotationFilterList} \alias{AnnotationFilterList-class} \alias{AnnotationFilterList} \alias{value,AnnotationFilterList-method} \alias{logicOp,AnnotationFilterList-method} \alias{logicOp} \alias{not,AnnotationFilterList-method} \alias{not} \alias{distributeNegation,AnnotationFilterList-method} \alias{distributeNegation} \alias{convertFilter,AnnotationFilterList,missing-method} \alias{convertFilter} \alias{show,AnnotationFilterList-method} \title{Combining annotation filters} \usage{ AnnotationFilterList(..., logicOp = character(), logOp = character(), not = FALSE, .groupingFlag = FALSE) \S4method{value}{AnnotationFilterList}(object) \S4method{logicOp}{AnnotationFilterList}(object) \S4method{not}{AnnotationFilterList}(object) \S4method{distributeNegation}{AnnotationFilterList}(object, .prior_negation = FALSE) \S4method{convertFilter}{AnnotationFilterList,missing}(object) \S4method{show}{AnnotationFilterList}(object) } \arguments{ \item{...}{individual \code{\link{AnnotationFilter}} objects or a mixture of \code{AnnotationFilter} and \code{AnnotationFilterList} objects.} \item{logicOp}{\code{character} of length equal to the number of submitted \code{AnnotationFilter} objects - 1. Each value representing the logical operation to combine consecutive filters, i.e. the first element being the logical operation to combine the first and second \code{AnnotationFilter}, the second element being the logical operation to combine the second and third \code{AnnotationFilter} and so on. Allowed values are \code{"&"} and \code{"|"}. The function assumes a logical \emph{and} between all elements by default.} \item{logOp}{Deprecated; use \code{logicOp=}.} \item{not}{\code{logical} of length one. Indicates whether the grouping of \code{AnnotationFilters} are to be negated.} \item{.groupingFlag}{Flag desginated for internal use only.} \item{object}{An object of class \code{AnnotationFilterList}.} \item{.prior_negation}{\code{logical(1)} unused argument.} } \value{ \code{AnnotationFilterList} returns an \code{AnnotationFilterList}. \code{value()} returns a \code{list} with \code{AnnotationFilter} objects. \code{logicOp()} returns a \code{character()} vector of \dQuote{&} or \dQuote{|} symbols. \code{not()} returns a \code{character()} vector of \dQuote{&} or \dQuote{|} symbols. \code{AnnotationFilterList} object with DeMorgan's law applied to it such that it is equal to the original \code{AnnotationFilterList} object but all \code{!}'s are distributed out of the \code{AnnotationFilterList} object and to the nested \code{AnnotationFilter} objects. \code{character(1)} that can be used as input to a \code{dplyr} filter. } \description{ The \code{AnnotationFilterList} allows to combine filter objects extending the \code{\link{AnnotationFilter}} class to construct more complex queries. Consecutive filter objects in the \code{AnnotationFilterList} can be combined by a logical \emph{and} (\code{&}) or \emph{or} (\code{|}). The \code{AnnotationFilterList} extends \code{list}, individual elements can thus be accessed with \code{[[}. \code{value()} get a \code{list} with the \code{AnnotationFilter} objects. Use \code{[[} to access individual filters. \code{logicOp()} gets the logical operators separating successive \code{AnnotationFilter}. \code{not()} gets the logical operators separating successive \code{AnnotationFilter}. Converts an \code{AnnotationFilterList} object to a \code{character(1)} giving an equation that can be used as input to a \code{dplyr} filter. } \note{ The \code{AnnotationFilterList} does not support containing empty elements, hence all elements of \code{length == 0} are removed in the constructor function. } \examples{ ## Create some AnnotationFilters gf <- GenenameFilter(c("BCL2", "BCL2L11")) tbtf <- TxBiotypeFilter("protein_coding", condition = "!=") ## Combine both to an AnnotationFilterList. By default elements are combined ## using a logical "and" operator. The filter list represents thus a query ## like: get all features where the gene name is either ("BCL2" or "BCL2L11") ## and the transcript biotype is not "protein_coding". afl <- AnnotationFilterList(gf, tbtf) afl ## Access individual filters. afl[[1]] ## Create a filter in the form of: get all features where the gene name is ## either ("BCL2" or "BCL2L11") and the transcript biotype is not ## "protein_coding" or the seq_name is "Y". Hence, this will get all feature ## also found by the previous AnnotationFilterList and returns also all ## features on chromosome Y. afl <- AnnotationFilterList(gf, tbtf, SeqNameFilter("Y"), logicOp = c("&", "|")) afl afl <- AnnotationFilter(~!(symbol == 'ADA' | symbol \%startsWith\% 'SNORD')) afl <- distributeNegation(afl) afl afl <- AnnotationFilter(~symbol=="ADA" & tx_start > "400000") result <- convertFilter(afl) result } \seealso{ \code{\link{supportedFilters}} for available \code{\link{AnnotationFilter}} objects } AnnotationFilter/tests/0000755000175400017540000000000013175715601016230 5ustar00biocbuildbiocbuildAnnotationFilter/tests/testthat/0000755000175400017540000000000013175715601020070 5ustar00biocbuildbiocbuildAnnotationFilter/tests/testthat.R0000644000175400017540000000011413175715601020207 0ustar00biocbuildbiocbuildlibrary(testthat) library(AnnotationFilter) test_check("AnnotationFilter") AnnotationFilter/tests/testthat/test_AnnotationFilter.R0000644000175400017540000001152513175715601024536 0ustar00biocbuildbiocbuildcontext("AnnotationFilter") test_that("supportedFilters() works", { expect_true(inherits(supportedFilters(), "data.frame")) expect_identical( nrow(supportedFilters()), length(unlist(AnnotationFilter:::.FIELD, use.names=FALSE)) + length(AnnotationFilter:::.FILTERS_WO_FIELD) ) }) test_that("SymbolFilter as representative for character filters", { expect_true(validObject(new("SymbolFilter"))) expect_error(SymbolFilter()) expect_error(SymbolFilter(1, ">")) expect_error(SymbolFilter(1, "foo")) expect_error(SymbolFilter(c("foo","bar"), "startsWith")) ## Getter / setter fl <- SymbolFilter("BCL2") expect_equal(value(fl), "BCL2") fl <- SymbolFilter(c(4, 5)) expect_equal(value(fl), c("4", "5")) fl <- SymbolFilter(3) expect_equal(value(fl), "3") expect_error(SymbolFilter(NA)) ## condition. expect_equal(condition(fl), "==") fl <- SymbolFilter("a", condition = "!=") expect_equal(condition(fl), "!=") expect_error(SymbolFilter("a", condition = "<")) expect_error(SymbolFilter("a", condition = "")) expect_error(SymbolFilter("a", condition = c("==", ">"))) expect_error(SymbolFilter("a", condition = NULL)) expect_error(SymbolFilter("a", condition = NA)) expect_error(SymbolFilter("a", condition = 4)) }) test_that("GeneStartFilter as representative for integer filters", { gsf <- GeneStartFilter(10000, condition = ">") expect_equal(condition(gsf), ">") expect_error(GeneStartFilter("3")) expect_error(GeneStartFilter("B")) expect_error(GeneStartFilter(NA)) expect_error(GeneStartFilter(NULL)) expect_error(GeneStartFilter()) ## Condition expect_error(GeneStartFilter(10000, condition = "startsWith")) expect_error(GeneStartFilter(10000, condition = "endsWith")) expect_error(GeneStartFilter(10000, condition = c("==", "<"))) }) test_that("GRangesFilter works", { GRanges <- GenomicRanges::GRanges grf <- GRangesFilter(GRanges("chr10:87869000-87876000")) expect_equal(condition(grf), "any") expect_error(GRangesFilter(value = 3)) expect_error(GRangesFilter( GRanges("chr10:87869000-87876000"), type = "==" )) grf <- GRangesFilter( GRanges("chr10:87869000-87876000"), type = "within", feature = "tx" ) expect_equal(condition(grf), "within") expect_equal(feature(grf), "tx") }) test_that("fieldToClass works", { expect_identical(AnnotationFilter:::.fieldToClass("gene_id"), "GeneIdFilter") ## Support replacement for multiple _ : issue #13 expect_identical(AnnotationFilter:::.fieldToClass("gene_seq_start"), "GeneSeqStartFilter") }) test_that("convertFilter Works", { expect_identical(convertFilter(SymbolFilter("ADA")), "symbol == 'ADA'") expect_identical(convertFilter(SymbolFilter("ADA", "!=")), "symbol != 'ADA'") expect_identical(convertFilter(SymbolFilter("ADA", "startsWith")), "symbol %like% 'ADA%'") expect_identical(convertFilter(SymbolFilter("ADA", "endsWith")), "symbol %like% '%ADA'") expect_identical(convertFilter(SymbolFilter("ADA", "contains")), "symbol %like% 'ADA'") expect_identical(convertFilter(TxStartFilter(1000)), "tx_start == '1000'") expect_identical(convertFilter(TxStartFilter(1000, "!=")), "tx_start != '1000'") expect_identical(convertFilter(TxStartFilter(1000, ">")), "tx_start > 1000") expect_identical(convertFilter(TxStartFilter(1000, "<")), "tx_start < 1000") expect_identical(convertFilter(TxStartFilter(1000, ">=")), "tx_start >= 1000") expect_identical(convertFilter(TxStartFilter(1000, "<=")), "tx_start <= 1000") ## check NOT works expect_identical(convertFilter(SymbolFilter("ADA", not=TRUE)), "!symbol == 'ADA'") expect_identical(convertFilter(SymbolFilter("ADA", "!=", not=TRUE)), "!symbol != 'ADA'") expect_identical(convertFilter(SymbolFilter("ADA", "startsWith", not=TRUE)), "!symbol %like% 'ADA%'") expect_identical(convertFilter(SymbolFilter("ADA", "endsWith", not=TRUE)), "!symbol %like% '%ADA'") expect_identical(convertFilter(SymbolFilter("ADA", "contains", not=TRUE)), "!symbol %like% 'ADA'") expect_identical(convertFilter(TxStartFilter(1000, not=TRUE)), "!tx_start == '1000'") expect_identical(convertFilter(TxStartFilter(1000, "!=", not=TRUE)), "!tx_start != '1000'") expect_identical(convertFilter(TxStartFilter(1000, ">", not=TRUE)), "!tx_start > 1000") expect_identical(convertFilter(TxStartFilter(1000, "<", not=TRUE)), "!tx_start < 1000") expect_identical(convertFilter(TxStartFilter(1000, ">=", not=TRUE)), "!tx_start >= 1000") expect_identical(convertFilter(TxStartFilter(1000, "<=", not=TRUE)), "!tx_start <= 1000") }) AnnotationFilter/tests/testthat/test_AnnotationFilterList.R0000644000175400017540000000536013175715601025372 0ustar00biocbuildbiocbuildcontext("AnnotationFilterList") test_that("AnnotationFilterList() works", { f1 <- GeneIdFilter("somegene") f2 <- SeqNameFilter("chr3") f3 <- GeneBiotypeFilter("protein_coding", "!=") fL <- AnnotationFilter:::AnnotationFilterList(f1, f2) expect_true(length(fL) == 2) expect_equal(fL[[1]], f1) expect_equal(fL[[2]], f2) expect_true(all(logicOp(fL) == "&")) fL <- AnnotationFilter:::AnnotationFilterList(f1, f2, f3, logicOp = c("&", "|")) expect_true(length(fL) == 3) expect_equal(fL[[1]], f1) expect_equal(fL[[2]], f2) expect_equal(fL[[3]], f3) expect_equal(logicOp(fL), c("&", "|")) ## A AnnotationFilterList with and AnnotationFilterList fL <- AnnotationFilter:::AnnotationFilterList(f1, f2, logicOp = "|") fL2 <- AnnotationFilter:::AnnotationFilterList(f3, fL, logicOp = "&") expect_true(length(fL) == 2) expect_true(length(fL2) == 2) expect_true(is(value(fL2)[[1]], "GeneBiotypeFilter")) expect_true(is(value(fL2)[[2]], "AnnotationFilterList")) expect_equal(value(fL2)[[2]], fL) expect_equal(fL2[[2]], fL) expect_equal(logicOp(fL2), "&") expect_equal(logicOp(fL2[[2]]), "|") }) test_that("empty elements in AnnotationFilterList", { ## empty elements should be removed from the AnnotationFilterList. empty_afl <- AnnotationFilterList() afl <- AnnotationFilterList(empty_afl) expect_true(length(afl) == 0) afl <- AnnotationFilterList(GeneIdFilter(4), empty_afl) expect_true(length(afl) == 1) afl <- AnnotationFilterList(GeneIdFilter(4), AnnotationFilter(~ gene_id == 3 | seq_name == 4),empty_afl) expect_true(length(afl) == 2) ## Check validate. afl@.Data <- c(afl@.Data, list(empty_afl)) ## Fix also the logOp. afl@logOp <- c(afl@logOp, "|") expect_error(validObject(afl)) }) test_that("convertFilter works", { smbl <- SymbolFilter("ADA") txid <- TxIdFilter(1000) gr <- GRangesFilter(GenomicRanges::GRanges("chr15:25062333-25065121")) expect_identical(convertFilter(AnnotationFilter(~smbl | txid)), "symbol == 'ADA' | tx_id == '1000'") expect_identical(convertFilter(AnnotationFilter(~smbl & (smbl | txid))), "symbol == 'ADA' & (symbol == 'ADA' | tx_id == '1000')") expect_identical(convertFilter(AnnotationFilter(~smbl & !(smbl | txid))), "symbol == 'ADA' & !(symbol == 'ADA' | tx_id == '1000')") expect_error(convertFilter(AnnotationFilter(smbl | (txid & gr)))) }) test_that("distributeNegation works", { afl <- AnnotationFilter(~!(symbol == 'ADA' | symbol %startsWith% 'SNORD')) afl2 <- AnnotationFilter(~!symbol == 'ADA' & !symbol %startsWith% 'SNORD') expect_identical(distributeNegation(afl), afl2) }) AnnotationFilter/tests/testthat/test_translate-utils.R0000644000175400017540000001136613175715601024414 0ustar00biocbuildbiocbuildcontext("expression translation") test_that("translation of expression works for single filter/condition", { ## Check for some character filter. ## exon_id flt <- ExonIdFilter("EX1", condition = "==") flt2 <- AnnotationFilter(~ exon_id == "EX1") expect_equal(flt, flt2) flt <- ExonIdFilter(c("EX1", "EX2"), condition = "!=") flt2 <- AnnotationFilter(~ exon_id != c("EX1", "EX2")) expect_equal(flt, flt2) ## seq_name flt <- SeqNameFilter(c("chr3", "chrX"), condition = "==") flt2 <- AnnotationFilter(~ seq_name == c("chr3", "chrX")) expect_equal(flt, flt2) flt <- SeqNameFilter(1:3, condition = "==") flt2 <- AnnotationFilter(~ seq_name %in% 1:3) expect_equal(flt, flt2) ## Check IntegerFilter flt <- GeneStartFilter(123, condition = ">") flt2 <- AnnotationFilter(~ gene_start > 123) expect_equal(flt, flt2) flt <- TxStartFilter(123, condition = "<") flt2 <- AnnotationFilter(~ tx_start < 123) expect_equal(flt, flt2) flt <- GeneEndFilter(123, condition = ">=") flt2 <- AnnotationFilter(~ gene_end >= 123) expect_equal(flt, flt2) flt <- ExonEndFilter(123, condition = "<=") flt2 <- AnnotationFilter(~ exon_end <= 123) expect_equal(flt, flt2) ## Test exceptions/errors. expect_error(AnnotationFilter(~ not_existing == 1:3)) ## Throws an error, but is not self-explanatory. expect_error(AnnotationFilter(~ gene_id * 3)) }) test_that("translation of combined expressions works", { res <- AnnotationFilter(~ exon_id == "EX1" & genename == "BCL2") cmp <- AnnotationFilterList(ExonIdFilter("EX1"), GenenameFilter("BCL2")) expect_equal(res, cmp) res <- AnnotationFilter(~ exon_id == "EX1" | genename != "BCL2") cmp <- AnnotationFilterList(ExonIdFilter("EX1"), GenenameFilter("BCL2", "!="), logicOp = "|") expect_equal(res, cmp) ## 3 filters. res <- AnnotationFilter(~ exon_id == "EX1" & genename == "BCL2" | seq_name != 3) ## Expect an AnnotationFilterList of length 3. expect_equal(length(res), 3) cmp <- AnnotationFilterList(ExonIdFilter("EX1"), GenenameFilter("BCL2"), SeqNameFilter(3, "!="), logicOp = c("&", "|")) expect_equal(res, cmp) ## 4 filters. res <- AnnotationFilter(~ exon_id == "EX1" & genename == "BCL2" | seq_name != 3 | seq_name == "Y") expect_equal(length(res), 4) cmp <- AnnotationFilterList(ExonIdFilter("EX1"), GenenameFilter("BCL2"), SeqNameFilter(3, "!="), SeqNameFilter("Y"), logicOp = c("&", "|", "|")) expect_equal(res, cmp) }) test_that("translation works from within other functions", { simpleFun <- function(x) AnnotationFilter(x) expect_equal(simpleFun(~ gene_id == 4), AnnotationFilter(~ gene_id == 4)) filter_expr <- ~ gene_id == 4 expect_equal(simpleFun(filter_expr), AnnotationFilter(~ gene_id == 4)) }) ## This might be a test if we get the nesting working. ## test_that("translation of nested expressions works" { ## res <- convertFilterExpression((exon_id == "EX1" & gene_id == "BCL2") | ## (exon_id == "EX3" & gene_id == "BCL2L11")) ## expect_equal(logicOp(res), "|") ## expect_true(is(res[[1]], "AnnotationFilterList")) ## expect_equal(res[[1]][[1]], ExonIdFilter("EX1")) ## expect_equal(res[[1]][[2]], GeneIdFilter("BCL2")) ## expect_equal(logicOp(res[[1]]), "&") ## expect_true(is(res[[2]], "AnnotationFilterList")) ## expect_equal(res[[2]][[1]], ExonIdFilter("EX3")) ## expect_equal(res[[2]][[2]], GeneIdFilter("BCL2L11")) ## expect_equal(logicOp(res[[2]]), "&") ## ## ## res <- convertFilterExpression(seq_name == "Y" | ## (exon_id == "EX1" & gene_id == "BCL2") & ## (exon_id == "EX3" & gene_id == "BCL2L11")) ## ## Expect: length 3, first being a SeqNameFilter, second an ## ## AnnotationFilterList, third a AnnotationFilterList. ## expect_equal(res[[1]], SeqNameFilter("Y")) ## expect_equal(logicOp(res), "|") ## expect_true(is(res[[2]], "AnnotationFilterList")) ## expect_equal(res[[1]][[1]], ExonIdFilter("EX1")) ## expect_equal(res[[1]][[2]], GeneIdFilter("BCL2")) ## expect_equal(logicOp(res[[1]]), "&") ## expect_true(is(res[[2]], "AnnotationFilterList")) ## expect_equal(res[[2]][[1]], ExonIdFilter("EX3")) ## expect_equal(res[[2]][[2]], GeneIdFilter("BCL2L11")) ## expect_equal(logicOp(res[[2]]), "&") ## expect_true(is(res[[1]], "AnnotationFilterList")) ## expect_true(is(res[[2]], "AnnotationFilterList")) ## convertFilterExpression((gene_id == 3) () ## }) AnnotationFilter/vignettes/0000755000175400017540000000000013175747732017110 5ustar00biocbuildbiocbuildAnnotationFilter/vignettes/AnnotationFilter.Rmd0000644000175400017540000003560413175715601023032 0ustar00biocbuildbiocbuild--- title: "Facilities for Filtering Bioconductor Annotation Resources" output: BiocStyle::html_document2: toc_float: true vignette: > %\VignetteIndexEntry{Facilities for Filtering Bioconductor Annotation resources} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignettePackage{AnnotationFilter} %\VignetteDepends{org.Hs.eg.db,BiocStyle,RSQLite} --- ```{r style, echo = FALSE, results = 'asis', message=FALSE} BiocStyle::markdown() ``` **Package**: `r Biocpkg("AnnotationFilter")`
**Authors**: `r packageDescription("AnnotationFilter")[["Author"]] `
**Last modified:** `r file.info("AnnotationFilter.Rmd")$mtime`
**Compiled**: `r date()` # Introduction A large variety of annotation resources are available in Bioconductor. Accessing the full content of these databases or even of single tables is computationally expensive and in many instances not required, as users may want to extract only sub-sets of the data e.g. genomic coordinates of a single gene. In that respect, filtering annotation resources before data extraction has a major impact on performance and increases the usability of such genome-scale databases. The `r Biocpkg("AnnotationFilter")` package was thus developed to provide basic filter classes to enable a common filtering framework for Bioconductor annotation resources. `r Biocpkg("AnnotationFilter")` defines filter classes for some of the most commonly used features in annotation databases, such as *symbol* or *genename*. Each filter class is supposed to work on a single database table column and to facilitate filtering on the provided values. Such filter classes enable the user to build complex queries to retrieve specific annotations without needing to know column or table names or the layout of the underlying databases. While initially being developed to be used in the `r Biocpkg("Organism.dplyr")` and `r Biocpkg("ensembldb")` packages, the filter classes and the related filtering concept can be easily added to other annotation packages too. # Filter classes All filter classes extend the basic `AnnotationFilter` class and take one or more *values* and a *condition* to allow filtering on a single database table column. Based on the type of the input value, filter classes are divided into: - `CharacterFilter`: takes a `character` value of length >= 1 and supports conditions `==`, `!=`, `startsWith` and `endsWith`. An example would be a `GeneIdFilter` that allows to filter on gene IDs. - `IntegerFilter`: takes a single `integer` as input and supports the conditions `==`, `!=`, `>`, `<`, `>=` and `<=`. An example would be a `GeneStartFilter` that filters results on the (chromosomal) start coordinates of genes. - `GRangesFilter`: is a special filter, as it takes a `GRanges` as `value` and performs the filtering on a combination of columns (i.e. start and end coordinate as well as sequence name and strand). To be consistent with the `findOverlaps` method from the `r Biocpkg("IRanges")` package, the constructor of the `GRangesFilter` filter takes a `type` argument to define its condition. Supported values are `"any"` (the default) that retrieves all entries overlapping the `GRanges`, `"start"` and `"end"` matching all features with the same start and end coordinate respectively, `"within"` that matches all features that are *within* the range defined by the `GRanges` and `"equal"` that returns features that are equal to the `GRanges`. The names of the filter classes are intuitive, the first part corresponding to the database column name with each character following a `_` being capitalized, followed by the key word `Filter`. The name of a filter for a database table column `gene_id` is thus called `GeneIdFilter`. The default database column for a filter is stored in its `field` slot (accessible *via* the `field` method). The `supportedFilters` method can be used to get an overview of all available filter objects defined in `AnnotationFilter`. ```{r supportedFilters} library(AnnotationFilter) supportedFilters() ``` Note that the `AnnotationFilter` package does provides only the filter classes but not the functionality to apply the filtering. Such functionality is annotation resource and database layout dependent and needs thus to be implemented in the packages providing access to annotation resources. # Usage Filters are created *via* their dedicated constructor functions, such as the `GeneIdFilter` function for the `GeneIdFilter` class. Because of this simple and cheap creation, filter classes are thought to be *read-only* and thus don't provide *setter* methods to change their slot values. In addition to the constructor functions, `AnnotationFilter` provides the functionality to *translate* query expressions into filter classes (see further below for an example). Below we create a `SymbolFilter` that could be used to filter an annotation resource to retrieve all entries associated with the specified symbol value(s). ```{r symbol-filter} library(AnnotationFilter) smbl <- SymbolFilter("BCL2") smbl ``` Such a filter is supposed to be used to retrieve all entries associated to features with a value in a database table column called *symbol* matching the filter's value `"BCL2"`. Using the `"startsWith"` condition we could define a filter to retrieve all entries for genes with a gene name/symbol starting with the specified value (e.g. `"BCL2"` and `"BCL2L11"` for the example below. ```{r symbol-startsWith} smbl <- SymbolFilter("BCL2", condition = "startsWith") smbl ``` In addition to the constructor functions, `AnnotationFilter` provides a functionality to create filter instances in a more natural and intuitive way by *translating* filter expressions (written as a *formula*, i.e. starting with a `~`). ```{r convert-expression} smbl <- AnnotationFilter(~ symbol == "BCL2") smbl ``` Individual `AnnotationFilter` objects can be combined in an `AnnotationFilterList`. This class extends `list` and provides an additional `logicOp()` that defines how its individual filters are supposed to be combined. The length of `logicOp()` has to be 1 less than the number of filter objects. Each element in `logicOp()` defines how two consecutive filters should be combined. Below we create a `AnnotationFilterList` containing two filter objects to be combined with a logical *AND*. ```{r convert-multi-expression} flt <- AnnotationFilter(~ symbol == "BCL2" & tx_biotype == "protein_coding") flt ``` Note that the `AnnotationFilter` function does not (yet) support translation of nested expressions, such as `(symbol == "BCL2L11" & tx_biotype == "nonsense_mediated_decay") | (symbol == "BCL2" & tx_biotype == "protein_coding")`. Such queries can however be build by nesting `AnnotationFilterList` classes. ```{r nested-query} ## Define the filter query for the first pair of filters. afl1 <- AnnotationFilterList(SymbolFilter("BCL2L11"), TxBiotypeFilter("nonsense_mediated_decay")) ## Define the second filter pair in ( brackets should be combined. afl2 <- AnnotationFilterList(SymbolFilter("BCL2"), TxBiotypeFilter("protein_coding")) ## Now combine both with a logical OR afl <- AnnotationFilterList(afl1, afl2, logicOp = "|") afl ``` This `AnnotationFilterList` would now select all entries for all transcripts of the gene *BCL2L11* with the biotype *nonsense_mediated_decay* or for all protein coding transcripts of the gene *BCL2*. # Using `AnnotationFilter` in other packages The `AnnotationFilter` package does only provide filter classes, but no filtering functionality. This has to be implemented in the package using the filters. In this section we first show in a very simple example how `AnnotationFilter` classes could be used to filter a `data.frame` and subsequently explore how a simple filter framework could be implemented for a SQL based annotation resources. Let's first define a simple `data.frame` containing the data we want to filter. Note that subsetting this `data.frame` using `AnnotationFilter` is obviously not the best solution, but it should help to understand the basic concept. ```{r define-data.frame} ## Define a simple gene table gene <- data.frame(gene_id = 1:10, symbol = c(letters[1:9], "b"), seq_name = paste0("chr", c(1, 4, 4, 8, 1, 2, 5, 3, "X", 4)), stringsAsFactors = FALSE) gene ``` Next we generate a `SymbolFilter` and inspect what information we can extract from it. ```{r simple-symbol} smbl <- SymbolFilter("b") ``` We can access the filter *condition* using the `condition` method ```{r simple-symbol-condition} condition(smbl) ``` The value of the filter using the `value` method ```{r simple-symbol-value} value(smbl) ``` And finally the *field* (i.e. column in the data table) using the `field` method. ```{r simple-symbol-field} field(smbl) ``` With this information we can define a simple function that takes the data table and the filter as input and returns a `logical` with length equal to the number of rows of the table, `TRUE` for rows matching the filter. ```{r doMatch} doMatch <- function(x, filter) { do.call(condition(filter), list(x[, field(filter)], value(filter))) } ## Apply this function doMatch(gene, smbl) ``` Note that this simple function does not support multiple filters and also not conditions `"startsWith"` or `"endsWith"`. Next we define a second function that extracts the relevant data from the data resource. ```{r doExtract} doExtract <- function(x, filter) { x[doMatch(x, filter), ] } ## Apply it on the data doExtract(gene, smbl) ``` We could even modify the `doMatch` function to enable filter expressions. ```{r doMatch-formula} doMatch <- function(x, filter) { if (is(filter, "formula")) filter <- AnnotationFilter(filter) do.call(condition(filter), list(x[, field(filter)], value(filter))) } doExtract(gene, ~ gene_id == '2') ``` For such simple examples `AnnotationFilter` might be an overkill as the same could be achieved (much simpler) using standard R operations. A real case scenario in which `AnnotationFilter` becomes useful are SQL-based annotation resources. We will thus explore next how SQL resources could be filtered using `AnnotationFilter`. We use the SQLite database from the `r Biocpkg("org.Hs.eg.db")` package that provides a variety of annotations for all human genes. Using the packages' connection to the database we inspect first what database tables are available and then select one for our simple filtering example. We use an `EnsDb` SQLite database used by the `r Biocpkg("ensembldb")` package and implement simple filter functions to extract specific data from one of its database tables. We thus load below the `EnsDb.Hsapiens.v75` package that provides access to human gene, transcript, exon and protein annotations. Using its connection to the database we inspect first what database tables are available and then what *fields* (i.e. columns) the *gene* table has. ```{r orgDb, message = FALSE} ## Load the required packages library(org.Hs.eg.db) library(RSQLite) ## Get the database connection dbcon <- org.Hs.eg_dbconn() ## What tables do we have? dbListTables(dbcon) ``` `org.Hs.eg.db` provides many different tables, one for each identifier or annotation resource. We will use the *gene_info* table and determine which *fields* (i.e. columns) the table provides. ```{r gene_info} ## What fields are there in the gene_info table? dbListFields(dbcon, "gene_info") ``` The *gene_info* table provides the official gene symbol and the gene name. The column *symbol* matches the default `field` value of the `SymbolFilter`. For the `GenenameFilter` we would have to re-map its default field `"genename"` to the database column *gene_name*. There are many possibilities to do this, one would be to implement an own function to extract the field from the `AnnotationFilter` classes specific to the database. This function eventually renames the extracted field value to match the corresponding name of the database column name. We next implement a simple `doExtractGene` function that retrieves data from the *gene_info* table and re-uses the `doFilter` function to extract specific data. The parameter `x` is now the database connection object. ```{r doExtractSQL} doExtractGene <- function(x, filter) { gene <- dbGetQuery(x, "select * from gene_info") doExtract(gene, filter) } ## Extract all entries for BCL2 bcl2 <- doExtractGene(dbcon, SymbolFilter("BCL2")) bcl2 ``` This works, but is not really efficient, since the function first fetches the full database table and subsets it only afterwards. A much more efficient solution is to *translate* the `AnnotationFilter` class(es) to an SQL *where* condition and hence perform the filtering on the database level. Here we have to do some small modifications, since not all condition values can be used 1:1 in SQL calls. The condition `"=="` has for example to be converted into `"="` and the `"startsWith"` into a SQL `"like"` by adding also a `"%"` wildcard to the value of the filter. We would also have to deal with filters that have a `value` of length > 1. A `SymbolFilter` with a `value` being `c("BCL2", "BCL2L11")` would for example have to be converted to a SQL call `"symbol in ('BCL2','BCL2L11')"`. Here we skip these special cases and define a simple function that translates an `AnnotationFilter` to a *where* condition to be included into the SQL call. Depending on whether the filter extends `CharacterFilter` or `IntegerFilter` the value has also to be quoted. ```{r simpleSQL} ## Define a simple function that covers some condition conversion conditionForSQL <- function(x) { switch(x, "==" = "=", x) } ## Define a function to translate a filter into an SQL where condition. ## Character values have to be quoted. where <- function(x) { if (is(x, "CharacterFilter")) value <- paste0("'", value(x), "'") else value <- value(x) paste0(field(x), conditionForSQL(condition(x)), value) } ## Now "translate" a filter using this function where(SeqNameFilter("Y")) ``` Next we implement a new function which integrates the filter into the SQL call to let the database server take care of the filtering. ```{r doExtractGene2} ## Define a function that doExtractGene2 <- function(x, filter) { if (is(filter, "formula")) filter <- AnnotationFilter(filter) query <- paste0("select * from gene_info where ", where(filter)) dbGetQuery(x, query) } bcl2 <- doExtractGene2(dbcon, ~ symbol == "BCL2") bcl2 ``` Below we compare the performance of both approaches. ```{r performance} system.time(doExtractGene(dbcon, ~ symbol == "BCL2")) system.time(doExtractGene2(dbcon, ~ symbol == "BCL2")) ``` Not surprisingly, the second approach is much faster. Be aware that the examples shown here are only for illustration purposes. In a real world situation additional factors, like combinations of filters, which database tables to join, which columns to be returned etc would have to be considered too. # Session information ```{r si} sessionInfo() ```