OrganismDbi/DESCRIPTION0000644000175200017520000000252714136076334015515 0ustar00biocbuildbiocbuildPackage: OrganismDbi Title: Software to enable the smooth interfacing of different database packages Description: The package enables a simple unified interface to several annotation packages each of which has its own schema by taking advantage of the fact that each of these packages implements a select methods. Version: 1.36.0 Encoding: UTF-8 Author: Marc Carlson, Hervé Pagès, Martin Morgan, Valerie Obenchain Maintainer: Bioconductor Package Maintainer Depends: R (>= 2.14.0), methods, BiocGenerics (>= 0.15.10), AnnotationDbi (>= 1.33.15), GenomicFeatures (>= 1.39.4) Imports: Biobase, BiocManager, GenomicRanges (>= 1.31.13), graph, IRanges, RBGL, DBI, S4Vectors (>= 0.9.25), stats Suggests: Homo.sapiens, Rattus.norvegicus, BSgenome.Hsapiens.UCSC.hg19, AnnotationHub, FDb.UCSC.tRNAs, mirbase.db, rtracklayer, biomaRt, RUnit, RMariaDB Collate: AllGenerics.R AllClasses.R methods-select.R methods-transcripts.R createOrganismPackage.R seqinfo.R test_OrganismDbi_package.R License: Artistic-2.0 biocViews: Annotation, Infrastructure git_url: https://git.bioconductor.org/packages/OrganismDbi git_branch: RELEASE_3_14 git_last_commit: 3e7a90d git_last_commit_date: 2021-10-26 Date/Publication: 2021-10-26 NeedsCompilation: no Packaged: 2021-10-26 22:10:36 UTC; biocbuild OrganismDbi/NAMESPACE0000644000175200017520000000413214136050033015204 0ustar00biocbuildbiocbuildimport(methods) import(BiocGenerics) import(AnnotationDbi) import(GenomicFeatures) import(S4Vectors) import(IRanges) importClassesFrom("graph", graphNEL) importClassesFrom("GenomicRanges", "GenomicRanges") importMethodsFrom("graph", edgeNames, edges, nodes, show, subGraph, union) importMethodsFrom("RBGL", bfs) importMethodsFrom("DBI", dbGetQuery) importFrom("Biobase", createPackage) importFrom("BiocManager", repositories) importFrom("graph", ftM2graphNEL) importFrom("stats", setNames) importFrom("utils", available.packages, contrib.url) exportClasses("OrganismDb") exportMethods(columns, keys, keytypes, select, mapIds, transcripts, exons, cds, genes, transcriptsBy, exonsBy, cdsBy, dbconn, dbfile, taxonomyId, metadata, seqinfo, getTxDbIfAvailable, TxDb, ## formal getter "TxDb<-", ## formal setter ## Below are methods that we are 'just wrapping' from TxDbs ## transcriptsByOverlaps, ## exonsByOverlaps, ## cdsByOverlaps, promoters, disjointExons, microRNAs, tRNAs, intronsByTranscript, fiveUTRsByTranscript, threeUTRsByTranscript, extractUpstreamSeqs, isActiveSeq , "isActiveSeq<-", asBED, asGFF, distance, mapToTranscripts, selectByRanges, selectRangesById ) export(getTxDbIfAvailable, ## b/c the generic is here TxDb, ## formal getter "TxDb<-", ## formal setter, resources, ## saveDb, selectByRanges, selectRangesById, makeOrganismPackage, makeOrganismDbFromTxDb, makeOrganismDbFromUCSC, makeOrganismDbFromBiomart) ## checkUsagePackage("OrganismDbi",all=TRUE) OrganismDbi/NEWS0000644000175200017520000000750514136050033014473 0ustar00biocbuildbiocbuildCHANGES IN VERSION 1.32.0 ------------------------ BUG FIXES o (v. 1.31.1) Load OrganismDb objects even if not on search path. See https://support.bioconductor.org/p/134141/ CHANGES IN VERSION 1.18.0 ------------------------ BUG FIXES o avoid duplicate factor levels during compression of metadata for cdsBy and friends; previously introduced incorrectly empty metadata CHANGES IN VERSION 1.16.0 ------------------------ NEW FEATURES o add check for missing OrgDb package in .taxIdToOrgDb() o add 'orgdb' argument to makeOrganismDbFromBiomart() MODIFICATIONS o modify error message in .taxIdToOrgDb() CHANGES IN VERSION 1.14.0 ------------------------ MODIFICATIONS o replace www.biomart.org with www.ensembl.org o import 'mcols', 'mcols<-' from S4Vectors o follow name change for GenomicFeatures:::.set_group_names() o add biomaRt, rtracklayer to 'Suggests'; used in unit tests/man pages o elementLengths was renamed -> elementNROWS in S4Vectors o replace require() with requireNamespace() o adjustments in response to the 'vals' -> 'filter' renaming in GenomicFeatures o update unit tests to reflect new PFAM data o load RSQLite in unit tests; no longer free from AnnotationDbi::dbFileConnect o use newly exported functions from AnnotationDbi related to select() and building annotation packages CHANGES IN VERSION 1.10.0 ------------------------ NEW FEATURES o OrganismDb objects now have methods for most reasonable TxDb methods. If you can do it with a TxDb, it should also work for an OrganismDb. o Supports dbconn() and dbfile() o Exports getTxDbIfAvailable() for those who may want convenient access to the underlying TxDb object. BUG FIXES o use.names arguments now work with exonsBy and related methods. CHANGES IN VERSION 1.4.0 ------------------------ NEW FEATURES o genes method now works on these objects BUG FIXES o safer argument handling for 'by' arguments on accessors that use it. CHANGES IN VERSION 1.4.0 ------------------------ NEW FEATURES o keys method now has new arguments to allow for more sophisticated filtering. BUG FIXES o cols arguments and methods will now be columns arguments and methods o stricter argument checking for select and friends CHANGES IN VERSION 1.2.0 ------------------------ NEW FEATURES o There are now range based accessors (transcripts, exons, cds) that can be used to get data out of OrganismDb objects. o There are now list oriented range based accesors (transcriptsBy, exonsBy, cdsBy) that can be used to get data out of OrganismDb objects o All these new ranged based operations allow users to extract additional metadata via mcols slot in the resulting object. The extra data is specified via the columns argument to these methods. SIGNIFICANT USER-VISIBLE CHANGES o New range based accessors allow you to get data back organized by range. Please see associated manual pages and vignette for more details. BUG FIXES o Show method for these objects is improved. o Documentation, tests and vignette updates for new methods. CHANGES IN VERSION 1.0.0 ------------------------ NEW FEATURES o OrganismDbi is for the creation and support of a new class of super-annotation packages o 'Stock' OrganismDbi packages are available for human mouse and rat with the 2.11 release of Bioconductor. o OrganismDbi has functionality for generating these packages. o OrganismDbi has methods that allow these packages to function. o OrganismDbi works by combining resources from Annotation packages that implement the four methods needed for a "select" style interface. The essense of this is detailed in the AnnotationDbi package, along with base classes etc. OrganismDbi/R/0000755000175200017520000000000014136050033014166 5ustar00biocbuildbiocbuildOrganismDbi/R/AllClasses.R0000644000175200017520000003472214136050033016347 0ustar00biocbuildbiocbuild## general functions go here: ## Still need to define my object in AnnotationDbi ############################################################################### ## MultiDb class Objects ## What is the big idea? ## A: A generic container linking several AnntoationDb objects ## ## These objects will contain a list of resources along with theprimary ## (package name and hence the name of the object that gets made when that ## package loads) and foreign keys for each resource. The foreign key is to ## be listed as a keytype. The basic notion is that these objects will allow ## us to write select() methods etc. that join contents of multiple databases. ## The relevant select ,cols, keys and keytypes methods should all call the ## base methods for each of the relevant packages, and then cat them together. ## Results should come back to the users and the object itself ought to be ## able to call select on each of the tables, and then merge it to the rest ## based on the foreign keytype etc. ## For OrganiosmDb: ## So lets start with an object where we have a slot for an OrgDb and also a ## slot for a GODb, then lets add in a slot for a TxDb. Since ## AnnotationDbi does not know about TxDbs, I will have to define the ## base class here (with the aim of creating a new software package to hold ## this stuff later on. ## require(AnnotationDbi) ## require(GenomicFeatures) ## I need an initialize method just to allow me to do things like ## require(GO.db) etc. ## class union so that I can have TxDb OR a null value setClassUnion("TxDbORNULL", c("TxDb", "NULL")) ## Original class MultiDb <- setClass("MultiDb", representation(keys="matrix", graph="graphNEL", txdbSlot="TxDbORNULL", ## optional slot resources="character") ) OrganismDb <- setClass("OrganismDb", contains="MultiDb") .cantFindResourceMsg <- function(pkg){ (paste0("The '", pkg, "' object is required but not ", "found. Please either install it (if it's from ", "a package) or ensure that it's loaded to the ", "search path.")) } ## helper for extracting pkgs and cols as a vector .extractPkgsAndCols <- function(gd){ setNames(as.vector(gd[,3:4]), as.vector(gd[,1:2])) } ## Constructors (not intended to be called by end user as it takes a ## very specific graphInfo object) (conceptually, the graphInfo is a ## seed object). ## Eventually, we will only use this internally (and it will have a ## differnt name like 'genericMultiDb'). And specific containers that ## do more for end users and take less specific inputs will be exposed ## to end users. MultiDb <- function(dbType=NULL, graphInfo, ns=NULL, ...){ ## make graphData into a graphNEL ## TODO: validate graphData -- required columns? gd <- graphInfo$graphData ## Then make actual graph graph <- ftM2graphNEL(gd[,1:2], edgemode="undirected") ## ## We should try to call require on all the supporting packages. ## pkgs <- unique(names(.extractPkgsAndCols(gd))) ## for (pkg in pkgs) ## .initPkg(pkg, dbType, ns=ns) ## Then call loadDb on all unloaded resources resources <- graphInfo$resources txdb <- NULL ## default value is NULL for(i in seq_along(resources)){ name <- names(resources[i]) if(!nzchar(resources[i]) && exists(name)){ obj <- get(name) }else if(!nzchar(resources[i]) && !exists(name)){ ## library(name, character.only=TRUE) (should be loaded by deps/user) obj <- get(name, envir=loadNamespace(name)) }else{ obj <- loadDb(resources[i]) assign(name,value=obj) } if(class(obj)=='TxDb'){txdb <- obj} ## stash it if it's a TxDb } ## Then make the object. new("MultiDb", ..., keys=gd, graph=graph, txdbSlot=txdb, resources=resources) } ## TODO: add to this so that we populate the TxDb slot (which we also need to add) OrganismDb <- function(dbType=NULL, graphInfo, ns=NULL, ...){ mdb <- MultiDb(dbType=NULL, graphInfo, ns=NULL, ...) new("OrganismDb", mdb, ...) } ########################################################### ## Convenience function that will load a package. ## IOW, there will be a call to this in zzz.R .loadOrganismDbiPkg <- function(pkgname, graphInfo){ ns <- asNamespace(pkgname) ## No longer any need for rules about where to find things... ## .addLocalPkgsToNamespace(pkgname, graphInfo, ns) obj <- OrganismDb(pkgname, graphInfo, ns) assign(pkgname, obj, envir=ns) namespaceExport(ns, pkgname) } ## Some getter methods to access the slots setGeneric("keyFrame", function(x) standardGeneric("keyFrame")) setMethod("keyFrame", "MultiDb", function(x){x@keys} ) setGeneric("dbGraph", function(x) standardGeneric("dbGraph")) setMethod("dbGraph", "MultiDb", function(x){x@graph} ) ## Then some helpers to process some of these results a bit .getDbObjNames <- function(x){ gd <- keyFrame(x) unique(c(gd[,1],gd[,2])) } .getDbObjs <- function(x){ dbs <- .getDbObjNames(x) setNames(lapply(dbs, .makeReal, x=x), dbs) } ## Show method (I am not really sure what to put here) setMethod("show", "MultiDb", function(object) { cat("class:", class(object), "\n") cat("Annotation resources:\n") objs <- .getDbObjNames(object) show(objs) cat("Annotation resource relationships:\n") kf <- keyFrame(object) show(kf) cat("Listed resources should each have their own show methods.\n") } ) ############################################################## ## Better show method ## library(Homo.sapiens);Homo.sapiens ## helpers for displaying kinds of objects .showGODb <- function(obj,name){ meta <- metadata(obj) cat('# Includes GODb Object: ', name,'\n') cat('# With data about: ', meta[meta$name=='GOSOURCENAME',2],'\n') ## cat('GO data last updated: ', meta[meta$name=='GOSOURCEDATE',2],'\n') ## cat('\n') } .showOrgDb <- function(obj,name){ meta <- metadata(obj) cat('# Includes OrgDb Object: ', name,'\n') cat('# Gene data about: ', meta[meta$name=='ORGANISM',2],'\n') cat('# Taxonomy Id: ', meta[meta$name=='TAXID',2],'\n') ## cat(': ', meta[meta$name=='GOSOURCEDATE',2],'\n') ## cat(': ', meta[meta$name=='GOSOURCEDATE',2],'\n') ## cat(': ', meta[meta$name=='GOSOURCEDATE',2],'\n') ## cat('\n') } .showTxDb <- function(obj,name){ meta <- metadata(obj) cat('# Includes TxDb Object: ', name,'\n') cat('# Transcriptome data about: ', meta[meta$name=='Organism',2],'\n') cat('# Based on genome: ', meta[meta$name=='Genome',2],'\n') ## cat(': ', meta[meta$name=='GOSOURCEDATE',2],'\n') ## cat('\n') } ## helper for choosing display for correct subObjects .showGeneralSubObject <- function(obj, name){ cls <- class(obj) switch(cls, 'TxDb'=.showTxDb(obj,name), 'OrgDb'=.showOrgDb(obj,name), 'GODb'=.showGODb(obj,name)) } .getOrgDbByClass <- function(objs){ objs[lapply(objs, class) %in% 'OrgDb'] } .getTxDbByClass <- function(objs){ objs[lapply(objs, class) %in% 'TxDb'] } .getKeyRowWithOrgDbAndTxDb <- function(objs, odb){ orgDbName <- names(.getOrgDbByClass(objs)) txDbName <- names(.getTxDbByClass(objs)) fKeys <- odb@keys orgIdx <- grepl(orgDbName, fKeys[,1]) | grepl(orgDbName, fKeys[,2]) txIdx <- grepl(txDbName, fKeys[,1]) | grepl(txDbName, fKeys[,2]) idx <- orgIdx & txIdx fKeys[idx,] } .showOrganismDbSpecificItems <- function(objs, odb){ ## extract foreign gene keys for OrgDb and TxDb: keyRow <- .getKeyRowWithOrgDbAndTxDb(objs, odb) ## match names with keys like this: orgDbName <- names(.getOrgDbByClass(objs)) if(grep(orgDbName, keyRow)==1){ orgKey <- keyRow[c(1,3)] txKey <- keyRow[c(2,4)] }else if(grep(orgDbName, keyRow)==2){ orgKey <- keyRow[c(2,4)] txKey <- keyRow[c(1,3)] } cat('# The OrgDb gene id',orgKey[2],'is mapped to the TxDb gene id', txKey[2],'.\n') } ## This could become the method for MultiDb objects too. ## But in that case I might want more general elements. ## A better plan is probably to just reuse some of these helpers in ## the other method (when you can) setMethod("show", "OrganismDb", function(object) { cat(class(object),'Object:\n') ## 1st get the objects objs <- .getDbObjs(object) ## loop along and switch the display based on the class mapply(.showGeneralSubObject, objs, names(objs)) ## This is the part that is truly OrganismDb specific (right now) .showOrganismDbSpecificItems(objs, odb=object) } ) ## For people who need ALL the metadata: setMethod("metadata", "MultiDb", function(x) { objs <- .getDbObjs(x) res <- data.frame() for(i in seq_len(length(objs))){ meta <- metadata(objs[[i]]) res <- rbind(meta,res) ## inneficient (but i is normally <= 4) } ## Then get rid of any rows that are repeated d <- duplicated(res[,1]) dupNames <- unique(res[d,1]) res <- res[!(res[,1] %in% dupNames),] message("Only unique values are returned by this metadata method. For individual metadata values that may share a key, such as the Db type, be sure to call metadata on the individual objects. \n") show(res) } ) ## resources returns the contents of the resources slot (simple accessor) .resources <- function(x){ x@resources } setMethod("resources", "MultiDb", function(x){.resources(x)}) ## this is for MultiDb (I should be able to save everything) ## And I just want a saveDb and loadDb methods as I am ## Specifically: they should both have readable ## databases in hand for all resources (and error if not) and they ## should save the object (not the data it depends on) while making ## sure that the database info is present (if needed). ## ## saveDb/loadDb methods which checks to the constructor etc. so that ## we always build the object and save it correctly. Martin proposed ## that I tell users to do TxDb(odb) <- saveDb(TxDb, ## file='TxDb.sqlite'). Martins idea kind of already happens (but ## without the friendly check/warnings if you call save() on the ## object... ## ## And Herve proposed this (which I think is even more user friendly ## (I suspect martin will agree): saveDb(odb, file=), where 'dir' ## is the name of a directory where all the relevant sqlite files will ## be saved alongside of an object.Rda file (which will contain a ## MultiDb with corrected paths). This object will then get found by ## loadDb and allow absolute saving no matter what... ## setMethod("saveDb", "MultiDb", function(..., file){ .saveMultDb(x, file)}) ############################################################################### ## Rules for these kinds of packages: ############################################################################### ## 1/2) There must be a select interface implemented. ## 1) You cannot have more than one example of each field that can be retrieved from each type of package that is included. So basically, all values for cols must be unique across ALL supporting packages. You cannot combine resources together if the cols are not unique. So if one package has a cols value that is "FOO", you cannot add any other packages that have a value of "FOO" for their cols. ## 2) You cannot have more than one example of each object type. So you cannot have two org packages (for example). ## 3) You cannot have cycles in the graph. Or maybe you can, but it is a bad idea because it can generate unpredictable results when the algorithm for walking along the tree nodes is used to interpolate nodes. IOW, whenever the algorithm has to traverse a cycle the route it takes will be consistent, but may not be the route that the user intended). ############################################################################## ## Some older stuff that I suspect I can toss out post-refactor: ## ## helpers to get all supporting libs loaded ## ## .initPkg needs to: ## ## 1) see if the package is on the search path? ## ## 2) if not on search path, try to see if it's installed (and load if needed). ## ## 3) emit an appropriate warning in either case. ## ## NOTE: ## ## This function deliberately does not use my .biocAnnPackages() ## ## function because I have no way of knowing if someone else ## ## has made a custom one or is using one from another repos. etc. ## .initPkg <- function(pkg, OrganismDbPkgName, ns=NULL){ ## ## message("pkg is:", pkg) ## ## message("OrganismDbPkgName is:", OrganismDbPkgName) ## if(!exists(pkg)){ ## IOW there is no object ## if(suppressWarnings(!library(pkg, ## character.only = TRUE, ## logical.return=TRUE))){ ## if(!is.null(OrganismDbPkgName)){ ## ## message("The '", pkg, "' pkg is now trying to load from 'inst/extdata/'.") ## pkgPath <- system.file("extdata", paste0(pkg,".sqlite"), ## package=OrganismDbPkgName) ## ## message("SEARCHING this path:", pkgPath) ## msg <- .cantFindResourceMsg(pkg) ## tryCatch(loadDb(pkgPath), ## error = function(e){stop(wmsg(msg))} ) ## } ## } ## } ## } ## ## helper that is just used for those resources that are not separate ## ## packages, but which need to have loadDb called (and be sealed into ## ## the namespace for the OrganismDb object) ## .addLocalPkgsToNamespace <- function(pkgname, graphData, ns){ ## pkgs <- unique(names(.extractPkgsAndCols(graphData))) ## xsts <- sapply(pkgs, exists) ## pkgs <- pkgs[!xsts] ## ## then get the ones that we don't already have and seal to the namespace. ## for(pkg in pkgs){ ## msg <- .cantFindResourceMsg(pkg) ## pkgPath <- system.file("extdata", paste0(pkg,".sqlite"), ## package=pkgname) ## tryCatch({assign(eval(pkg), loadDb(pkgPath))}, ## error = function(e){stop(wmsg(msg))} ) ## assign(pkg, get(eval(pkg)), envir=ns) ## namespaceExport(ns, pkg) ## } ## } ## ## TODO: namespace looks like it doesn't need to be passed around (just referrered to be name with asNamespace()) ## ## So instead of passing that around, just get it (as needed) and add things to it in each place. ## ## lose the list (pkgVals) ## ## And get sorted why I have this bug with not being able to export an OrgDb (when I was doing that just a little bit ago)... OrganismDbi/R/AllGenerics.R0000644000175200017520000000116314136050033016502 0ustar00biocbuildbiocbuildsetGeneric("getTxDbIfAvailable", function(x, ...) standardGeneric("getTxDbIfAvailable")) ## getter/setter set setGeneric("TxDb",function(x, ...) standardGeneric("TxDb")) setGeneric("TxDb<-",signature="x",function(x, value) standardGeneric("TxDb<-")) setGeneric("selectByRanges", signature="x", function(x, ranges, columns, overlaps, ignore.strand) standardGeneric("selectByRanges")) setGeneric("selectRangesById", signature="x", function(x, keys, columns, keytype, feature) standardGeneric("selectRangesById")) setGeneric("resources", function(x) standardGeneric("resources")) OrganismDbi/R/createOrganismPackage.R0000644000175200017520000005031714136050033020536 0ustar00biocbuildbiocbuild## New helper to lookup which org object or package should be used ## based on the taxonomy ID. It takes a tax ID and returns an appropriate OrgDb object. .packageTaxIds <- function(){ c('180454'='org.Ag.eg.db', '3702'='org.At.tair.db', '9913'='org.Bt.eg.db', '9615'='org.Cf.eg.db', '9031'='org.Gg.eg.db', '9598'='org.Pt.eg.db', '511145'='org.EcK12.eg.db', '386585'='org.EcSakai.eg.db', '7227'='org.Dm.eg.db', '9606'='org.Hs.eg.db', '10090'='org.Mm.eg.db', '9823'='org.Ss.eg.db', '10116'='org.Rn.eg.db', '9544'='org.Mmu.eg.db', '6239'='org.Ce.eg.db', '8355'='org.Xl.eg.db', '559292'='org.Sc.sgd.db', '7955'='org.Dr.eg.db', '36329'='org.Pf.plasmo.db') } .taxIdToOrgDb <- function(taxid) { ## These are the packaged TaxIds packageTaxIds <- .packageTaxIds() if (taxid %in% names(packageTaxIds)) { pkg <- packageTaxIds[names(packageTaxIds) %in% taxid] nmspc <- loadNamespace(pkg) res <- get(pkg, nmspc) } else { ## If we don't have a package, then lets get the taxIds and AHIds ## for the hub objects loadNamespace("AnnotationHub") ah <- AnnotationHub::AnnotationHub() ah <- subset(ah, ah$rdataclass=='OrgDb') mc <- mcols(ah)[,'taxonomyid', drop=FALSE] ## Then just get the object AHID <- rownames(mc[mc$taxonomyid==taxid,,drop=FALSE]) if (!length(AHID)) stop("no OrgDb package found for taxid ", taxid) else res <- ah[[AHID]] } res } ## examples: ## .taxIdToOrgDb(9606) ## .taxIdToOrgDb(9986) ## Need another helper to get us from taxID to the OrgDbName... .taxIdToOrgDbName <- function(taxid) { packageTaxIds <- .packageTaxIds() if (taxid %in% names(packageTaxIds)) { pkg <- packageTaxIds[names(packageTaxIds) %in% taxid] nmspc <- loadNamespace(pkg) path <- dbfile(get(pkg, nmspc)) res <- sub("sqlite$", "db", basename(path)) } else { ## If we don't have a package, then lets get the taxIds and AHIds ## for the hub objects loadNamespace("AnnotationHub") ah <- AnnotationHub::AnnotationHub() ah <- subset(ah, ah$rdataclass=='OrgDb') mc <- mcols(ah)[,c('taxonomyid','title'), drop=FALSE] ## Then just get the object data <- mc[mc$taxonomyid==taxid,,drop=FALSE] res <- sub("sqlite","db", data$title) } res } ## examples ## .taxIdToOrgDbName(9606) ## .taxIdToOrgDbName(9986) ############################################################################# ## simplify DB retrievals from metadata table .getMetaDataValue <- function(db, name){ con <- AnnotationDbi::dbconn(db) res <- dbGetQuery(con, paste0("SELECT value FROM metadata WHERE name='", name,"'"))[[1]] if(!is.character(res)) stop("missing metadata table value for: ", name) res } ## function to allow us to convert a list into the inernally preferred form... .mungeGraphData <- function(graphData){ ##pkgs <- sapply(graphData, names) ## This is no good. :( pkgs <- matrix(unlist(lapply(graphData, names)), ncol=2, byrow=TRUE) keys <- matrix(unlist(graphData), ncol=2, byrow=TRUE) graphData <- cbind(pkgs, keys) colnames(graphData) <- c("xDbs","yDbs","xKeys","yKeys") graphData } ## early sanity checks for graphData .testGraphData <- function(graphData){ if(ncol(graphData) !=4L) stop("'graphData' must contain exactly 4 columns.") } ## test keys for graphData BEFORE we make any objects (which just ## means that we are not going to use data from the @resources slot ## for this function) .testKeys <- function(fkeys){ pkgs <- unlist(lapply(names(fkeys), get)) res <- logical(length(pkgs)) for(i in seq_len(length(pkgs))){ res[i] <- fkeys[i] %in% columns(pkgs[[i]]) } if(!all(res)) stop("some foreign keys are not present in their associated databases") } ## helper to list bioc Annot packages (for filling in things like ## suggests fields) .biocAnnPackages <- function(){ availAnns <- as.data.frame(available.packages( contrib.url(repositories()[["BioCann"]], "source"))) as.character(availAnns[["Package"]]) } ## Helper to set up to just load packages that need loading. .extractDbFiles <- function(gd, deps){ pkgs <- unique(names(.extractPkgsAndCols(gd))) ## Before we can proceed, we may need to call library on the deps... lapply(deps, library, character.only = TRUE) files <- unlist(lapply(pkgs, function(x){dbfile(get(x))})) setNames(files, pkgs) } ## We want makeOrganismPackage to be self contained (have all it needs) ## IOW we want to store .sqlite files in a local inst/extdata when ## they are not known packages. makeOrganismPackage <- function(pkgname, graphData, organism, version, maintainer, author, destDir, license="Artistic-2.0"){ ## there should only be one template template_path <- system.file("OrgPkg-template",package="OrganismDbi") ## We need to get a list of dependencies: ## 1st convert graphData into a data.frame gd <- .mungeGraphData(graphData) allDeps <- unique(as.vector(gd[,1:2])) ## Filter dependencies to make sure they are really package names biocPkgNames <- .biocAnnPackages() deps <- allDeps[allDeps %in% biocPkgNames] depsStr <- paste(deps,collapse=", ") ## We need to define some symbols in order to have the ## template filled out correctly. symvals <- list( PKGTITLE=paste("Annotation package for the",pkgname,"object"), PKGDESCRIPTION=paste("Contains the",pkgname,"object", "to access data from several related annotation packages."), PKGVERSION=version, AUTHOR=author, MAINTAINER=maintainer, LIC=license, ORGANISM=organism, ORGANISMBIOCVIEW=gsub(" ","_",organism), PKGNAME=pkgname, DEPENDENCIES=depsStr ) ## Check the graphData object and rename if needed .testGraphData(gd) ## Try to call require on all the supporting packages. ## pkgs <- unique(names(.extractPkgsAndCols(gd))) ## for (pkg in pkgs) ## .initPkg(pkg) ## ######################################################################### ## Extract the dbFile information from each object and store that ## into resources ## EXCEPT Don't: (I can't really do this for 'packages' as the data ## is system specific) ## TODO: change this so that it isn't getting a bunch of dbFiles ## and then throwing them aways (or so that it's optional with a ## parameter or whatever seems appropriate for this function) resources <- .extractDbFiles(gd, deps) resources <- unlist(lapply(resources, function(x){return("")})) ## ######################################################################### ## Also check that the fkeys are really columns for the graphData fkeys <- .extractPkgsAndCols(gd) .testKeys(fkeys) ## Should never have duplicates if (any(duplicated(names(symvals)))) stop("'symvals' contains duplicated symbols") ## All symvals should by single strings (non-NA) is_OK <- sapply(symvals, isSingleString) if (!all(is_OK)) { bad_syms <- paste(names(is_OK)[!is_OK], collapse="', '") stop("values for symbols '", bad_syms, "' are not single strings") } createPackage(pkgname=pkgname, destinationDir=destDir, originDir=template_path, symbolValues=symvals) ## Now just do work to save the data.frame (pass that in instead of the ## other stuff) in /data as a serialized R file. ## There will already be a /data dir in the template ## So just save to it: graphData <- gd graphInfo <- list(graphData=graphData, resources=resources) ## create data dir (because R CMD build removes empty dirs) ## And then save the data there. dir.create(file.path(destDir,pkgname,"data")) save(graphInfo, file=file.path(destDir,pkgname,"data","graphInfo.rda")) ## Get and other things that need to be saved and stash them into ## /inst/extdata otherDeps <- allDeps[!allDeps %in% biocPkgNames] .saveFromStr <- function(x, file){saveDb(x=get(x), file=file)} if(length(otherDeps)>0){ ## Then we have to save stuff locally extPath <- file.path(destDir,pkgname,"inst","extdata") dir.create(extPath, recursive=TRUE) mapply(.saveFromStr, x=otherDeps, file=file.path(extPath, paste0(otherDeps,".sqlite"))) } } ################################################################################ ## Now for some create functions that are more specialized: ## To create OrgansimDb objects from UCSC or from biomaRt ## the initial versions of these will just create the object (and not ## a package) ## Also: there are some pre-agreed upon expectations for this ## function. It will make a specific kind of OrganismDb object. One ## that contains certain specific elements (GO, OrgDb and TxDb - to ## start). And the expectation is that we will always have those ## things when this function finishes. So the contract is less ## general than before. ######## ## Helper to set up to just load packages that need loading. BUT this ## version of this function is less agressive and doesn't try to load ## a file if it already exists. This is a different behavior than we ## want for packaging where things should be more strict. .gentlyExtractDbFiles <- function(gd, deps){ pkgs <- unique(names(.extractPkgsAndCols(gd))) ## Before we can proceed, we may need to call library on the deps... .library <- function(dep){ if(!exists(dep)){ library(dep, character.only = TRUE) } } lapply(deps, .library) files <- unlist(lapply(pkgs, function(x){dbfile(get(x))})) setNames(files, pkgs) } ## from TxDb ## There are some issues with this as it's currently implemented: ## Because its not using the txdb that is passed in but is instead ## making one up and then assigning it to the global namespace. This ## is what we want for makeOrgansimDbFromXXX when XXX is UCSC or ## BiomaRt, but its *not* what we want for a simple exsting txdb. ## It's straightforward to do something different when the txdb ## exists(). But: how to I look up it's original name in the ## .GlobalEnv ??? I need to know it's name for the graphData list ## below. ## The other big problem is that if my TxDb objects don't have a ## dbfile() then they can't be saved and re-loaded later. But this ## problem is not "new" for this object. ## Still TODO: 1) either put the assigned functions in a special env that is accesible to these objects OR else make a subclass that can hold those objects. ## 2) make a save method that complains if any of the objects has not had saveDb called on it before calling a constructor. And also add a warning to the constructor when somethings dbfile() is an empty string. ### Private environment for storing TxDb objects as needed (failed strategy) ## To work I would have to make the env public. - It was never a ## great idea fo r this use case. ## TxDbObjs <- new.env(hash=TRUE, parent=emptyenv()) ## I may still want to go with the other option of stashing this data ## into a subclass... For one thing, that option can't have name ## clashes... ## Also: the name this local TxDb gets assigned to cannot be the same as is used by a package. Otherwise a shortened 'custom' TxDb can be overwritten by a name clash with a package name... This could end up being true even if I store the TxDb locally inside of a named sub-class. ## Also also: the name should not be made 'special' in the case where makeOrganismDbFromTxDb is called as a helper function from within makeOrganismDbFromUCSC or makeOrganismDbFromBiomart. makeOrganismDbFromTxDb <- function(txdb, keytype=NA, orgdb=NA){ if (!is(txdb, "TxDb")) stop("'txdb' must be a TxDb object") if(!is(orgdb, "OrgDb") && !is.na(orgdb)) stop("'orgdb' must be an OrgDb object or NA") if (!isSingleStringOrNA(keytype)) stop("'keytype' must be a single string or NA") ## Then assign that object value to the appropriate name: txdbName <- makePackageName(txdb) ## We temp assign to global scope ## (b/c you need it there if you 'generated' it) ## After we can remove it? (will be stored in the object) assign(txdbName, txdb, .GlobalEnv) ## Then get the tax ID: taxId <- taxonomyId(txdb) ## Then get the name and valued for the OrgDb object if (!is(orgdb, "OrgDb") && is.na(orgdb)){ orgdbName <- OrganismDbi:::.taxIdToOrgDbName(taxId) orgdb <- OrganismDbi:::.taxIdToOrgDb(taxId) assign(orgdbName, orgdb, .GlobalEnv) }else{ org <- metadata(orgdb)[metadata(orgdb)$name=='ORGANISM',2] org <- sub(" ", "_", org) orgdbName <- paste0('org.',org,'.db') orgdb <- orgdb assign(orgdbName, orgdb, .GlobalEnv) } ## get the primary key for the OrgDb object: if(is.na(keytype)){ geneKeyType <- chooseCentralOrgPkgSymbol(orgdb) }else{ geneKeyType <- keytype } graphData <- list(join1 = setNames(object=c('GOID', 'GO'), nm=c('GO.db', orgdbName)), join2 = setNames(object=c(geneKeyType, 'GENEID'), nm=c(orgdbName, txdbName))) ## get the organism organism <- organism(txdb) ############################################################# ## Process and then test the graph Data gd <- OrganismDbi:::.mungeGraphData(graphData) OrganismDbi:::.testGraphData(gd) allDeps <- unique(as.vector(gd[,1:2])) biocPkgNames <- OrganismDbi:::.biocAnnPackages() deps <- allDeps[allDeps %in% biocPkgNames] resources <- OrganismDbi:::.gentlyExtractDbFiles(gd, deps) ## Check that the fkeys are really columns for the graphData fkeys <- OrganismDbi:::.extractPkgsAndCols(gd) OrganismDbi:::.testKeys(fkeys) ## Then make the object: graphInfo <- list(graphData=gd, resources=resources) OrganismDbi:::OrganismDb(graphInfo=graphInfo) } ## from UCSC makeOrganismDbFromUCSC <- function(genome="hg19", tablename="knownGene", transcript_ids=NULL, circ_seqs=NULL, url="http://genome.ucsc.edu/cgi-bin/", goldenPath.url=getOption("UCSC.goldenPath.url"), miRBaseBuild=NA){ ## So call the function to make that TxDb txdb <- makeTxDbFromUCSC(genome=genome, tablename=tablename, transcript_ids=transcript_ids, circ_seqs=circ_seqs, url=url, goldenPath.url=goldenPath.url, miRBaseBuild=miRBaseBuild) makeOrganismDbFromTxDb(txdb) } makeOrganismDbFromBiomart <- function(biomart="ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl", transcript_ids=NULL, circ_seqs=NULL, filter="", id_prefix="ensembl_", host="www.ensembl.org", port=80, miRBaseBuild=NA, keytype="ENSEMBL", orgdb = NA){ ## So call the function to make that TxDb txdb <- makeTxDbFromBiomart(biomart=biomart, dataset=dataset, transcript_ids=transcript_ids, circ_seqs=circ_seqs, filter=filter, id_prefix=id_prefix, host=host, port=port, miRBaseBuild=miRBaseBuild) makeOrganismDbFromTxDb(txdb, keytype=keytype, orgdb=orgdb) } ## PROBLEM: OrganismDbi:::.extractDbFiles(gd, deps) requires (strictly) that all objects be available as files somewhere (no exceptions allowed) ## This means that when I get to this stage, with biomaRt, it fails because there is not a TxDb on disc... ################################################################################ ################################################################################ .getMaxEns <- function(srcUrl){ release <- sub("^ftp://ftp.ensembl.org/pub/release-","",srcUrl) release <- unique(sub("/gtf/.*gtf.gz$","", release)) max(release) } ## I need a function that will list the GTFs that end users can use to ## make into TxDbs (will probably not overlap perfectly with available ## OrgDbs) available.GTFsForTxDbs <- function() { loadNamespace("AnnotationHub") ah <- AnnotationHub::AnnotationHub() ## get OrgDb species aho <- subset(ah, ah$rdataclass=='OrgDb') oTaxids <-unique(aho$taxonomyid) ## get GTF species from ensembl ahg <- subset(ah, grepl('gtf.gz$',ah$sourceurl)) ahg <- subset(ahg, ahg$dataprovider=='Ensembl') max <- .getMaxEns(ahg$sourceurl) maxStr <- paste0("ftp://ftp.ensembl.org/pub/release-",max) ahg <- subset(ahg, grepl(maxStr,ahg$sourceurl)) gTaxids <-unique(ahg$taxonomyid) ## intersect of taxIds taxInt <- intersect(oTaxids, gTaxids) ## subset down to just the ahgs that can work... ahg <- subset(ahg, ahg$taxonomyid %in% taxInt) ## for now return the subsetted annotationHubObject ahg } ## And I need a function that will make the transformation for them makeHubGTFIntoTxDb <- function(ahg){ if(length(ahg) > 1){ stop('This function expects only one hub object at a time.')} ## get the available GTFs. ahgs <- available.GTFsForTxDbs() ## Is this one of those? If so, then make it happen if(names(ahg) %in% names(ahgs)){ txMeta <- data.frame(name='Data source', value='Ensembl GTF') txdb <- makeTxDbFromGRanges(ahg[[1]], metadata= txMeta, taxonomyId=ahg$taxonomyid) require(OrganismDbi) ## requires using the 'ENSEMBL' keytype (for these TxDbs) ## odb <- makeOrganismDbFromTxDb(txdb, keytype='ENSEMBL') odb <- makeOrganismDbFromTxDb(txdb) }else{ stop('No OrgDb information for ', ahg$species) } odb } ## testing: ## ahgs <- OrganismDbi:::available.GTFsForTxDbs() ## ahg <- ahgs[1] ## odb <- OrganismDbi:::makeHubGTFIntoTxDb(ahg) ## debug(OrganismDbi:::makeHubGTFIntoTxDb) ## debug(OrganismDbi:::makeOrganismDbFromTxDb) ## debug(OrganismDbi:::.gentlyExtractDbFiles) ## 1st bad problem is that it basically can't find this: ## org.Ailuropoda_melanoleuca.eg.db. This is bad since it is one more ## thing that I have to put into the global namespace (temp hack). ## This will need to be done a different way! Should I use a special ## package environment? Will it help if I put a slot in that can hold ## an OrdDb? Or should I find a way to make the resources vector more ## generic by having it only apply to hub items (IOW combine a generic ## character vector with hubcache information). - It is clear that the ## resources vector cannot work for all kinds of paths and situations ## though. Since people can put objects ANYWHERE. ## 2nd bad problem (even worse if you can believe it), is that these ## orgDbs don't have ensembl IDs in them. This is bad because that ## basically means that none of these organisms can be supported for ## these TxDbs (which will use ensembl based gene identifiers. I can ## update the OrgDbs, it's just a crucial missing piece that has to be ## retrieved before I can proceed. ## 3rd problem this uncovered is that I had to use AH cars for the ## object just to get easy access to the metadata. It seems like we ## could benefit from a universal application of that metadata to the ## contents that come out of these cars... Then it would be ## straightforward to write a method like I wanted to here. OrganismDbi/R/methods-select.R0000644000175200017520000006763114136050033017246 0ustar00biocbuildbiocbuild# This will just hold code for the initial implementation of select and friends ## helper to convert text strings (Db pkgs names) into real objects ## x is an OrgDb object, and str is the name we want made into an object... .makeReal <- function(x, str){ resource <- x@resources[names(x@resources) %in% str] if (length(resource)==1) { if (resource != "") { res <- loadDb(resource) } else { ## otherwise use the local name res <- tryCatch({ get(str, envir = asNamespace(str)) }, error = function(err) { ## original implementation, but fails if 'str' (e.g., ## 'GO.db') is from a package not attached to the ## search path. Are all 'str' always from packages, ## with names following the convention str::str? We'll ## leave this as a fall-back for user-defined objects ## 'str' available only on the search path. get(str) }) } } else { stop("object does not contain resource named '", str, "'") } res } ## Standard methods: .keytypes <- function(x){ dbs <- .getDbObjs(x) unique(unlist(lapply(dbs, keytypes))) } setMethod("keytypes", "MultiDb", .keytypes) ## Usage: ## keytypes(Homo.sapiens) .cols <- function(x){ dbs <- .getDbObjs(x) unique(unlist(lapply(dbs, columns))) } setMethod("columns", "MultiDb", function(x){.cols(x)}) ## Usage: ## columns(Homo.sapiens) ## Strategy for keys: I need a lookup function that can 1) generate the keys ## for each slot and then lookup which slot I should be tapping based on a ## keytype. 2) This needs to be general purpose (will be needed again in ## select) and 3) it also may need to be able to return multiple hits in the ## event that there are eventually multiple IDs named the same way (depends on ## whether or not we allow repeat ID names). I think we WILL want to allow ## this, which means I will have to do some kind of name-spacing scheme. .makekeytypeMapping <- function(x){ objs <- .getDbObjs(x) unlist2(lapply(objs, keytypes)) } .lookupDbNameFromKeytype <- function(x, keytype){ res <- .makekeytypeMapping(x) ## no duplicates so I can just return the name names(res)[res %in% keytype] } .lookupDbFromKeytype <- function(x, keytype){ .makeReal(x, str=.lookupDbNameFromKeytype(x, keytype)) } .keys <- function(x, keytype, ...){ testForValidKeytype(x, keytype) db <- .lookupDbFromKeytype(x, keytype) ## And then we can just call keys... as.character(keys(db, keytype, ...)) } setMethod("keys", "MultiDb", .keys) ## Usage: ## head(keys(Homo.sapiens, keytype="PMID")) ## the use case for GOID will present a special challenge... ## head(keys(Homo.sapiens, keytype="GOID")) ## This method just gets me the pkg names as names and vals are fkeys .getDbNameFKeys <- function(x){ gd <- keyFrame(x) ## now give all the keys as a vector, but named by their databases. .extractPkgsAndCols(gd) } ## .mkeys will return appropriate value "on the fly" based on the ## contents of keyFrame(). It will take at least three arguments: the two ## tables plus an indicator for which of the two keys 1st or 2nd table key is ## needed. ## tbl1,tbl2 wil be actual package names like 'org.Hs.eg.db' or 'GO.db' ## tbl1 = "TxDb.Hsapiens.UCSC.hg19.knownGene" ## tbl2 = "org.Hs.eg.db" ## key = "tbl1" .parseCol <- function(piece, str) grepl(str, piece) .mkeys <- function(x, tbl1, tbl2, key=c("tbl1","tbl2", "both")){ if(length(tbl1) != 1L || length(tbl2) != 1L) stop("specify only one pair of tables at a time") key <- match.arg(key) kf <- keyFrame(x) ## process for a double match of tbl1 and tbl2 (in any order) ## note: (we should ALWAYS have one when this function is called) res <- apply(kf[,1:2], MARGIN=2, FUN=.parseCol, tbl1) res2 <- apply(kf[,1:2], MARGIN=2, FUN=.parseCol, tbl2) fin <- res | res2 resRowIdx <- fin[,1] & fin[,2] matchRow <- kf[resRowIdx,] if(length(matchRow) == 0L) stop("no relationship found for ",tbl1," and ",tbl2) ## now the tricky part is that in returning the keys I have to get the ## correct keys back to the user... And this is based on whether tbl1 was ## one thing or another. if(length(matchRow[["xDbs"]]) >1L) stop("failed to limit choices to 1") if(key=="tbl1"){ if(grepl(tbl1,matchRow[["xDbs"]])){ ans <- as.character(matchRow[["xKeys"]]) }else{ ## then its reversed of the order in the row... ans <- as.character(matchRow[["yKeys"]]) } }else if(key=="tbl2"){ if(grepl(tbl2,matchRow[["yDbs"]])){ ans <- as.character(matchRow[["yKeys"]]) }else{ ## and the reverse case ans <- as.character(matchRow[["xKeys"]]) } }else if(key=="both"){ ans <- c(as.character(matchRow[["xKeys"]]), as.character(matchRow[["yKeys"]])) names(ans) <- c(as.character(matchRow[["xDbs"]]), as.character(matchRow[["yDbs"]])) ## When we say "both" we still want keys returned in same order as ## original packages. IOW, if tbl1 goes with key 1, then we should list ## key 1 1st in the result... ans <- ans[match(c(tbl1,tbl2),names(ans))] } ans } ## helper for getting all cols by all nodes .colsByNodes <- function(x){ gr <- dbGraph(x) allCols <- lapply(nodes(gr), function(elt, x){columns(.makeReal(x, elt))}, x=x) names(allCols) <- nodes(gr) allCols } ## library(Homo.sapiens) ## library(RBGL) ## library(graph)a ## x = Homo.sapiens ## allCols <- .colsByNodes(x) ## helper to get the subgraph .getRelevantSubgraph <- function(x, cols, keys, keytype){ gr <- dbGraph(x) allCols <- .colsByNodes(x) inSubgraph <- sapply(allCols, function(cols, keys) any(keys %in% cols), union(keytype, cols)) subGraph(names(inSubgraph)[inSubgraph], gr) } ## kt <- "ENTREZID" ## cls = c("GOID" , "SYMBOL", "TXNAME") ## keys <- head(keys(x, "ENTREZID")) ## subgr <- .getRelevantSubgraph(x, cols=cls, keys, keytype=kt) ## We will also need the root ## root = OrganismDbi:::.lookupDbNameFromKeytype(x, kt) ## I think this is meant to be an lapply .getForeignKeys <- function(x, subgr){ fKeys <- lapply(strsplit(edgeNames(subgr), "~"), function(tables, x, key) .mkeys(x, tables[[1]], tables[[2]], "both"), x) unlist(fKeys, use.names=FALSE) } ## fKeys <- .getForeignKeys(x, subgr) ## now combine all the keys together ## selectCols = unique(c(kt, fKeys, cls)) ## sort the needed cols by their nodes .getColsByNodes <- function(subgr, selectCols, allCols){ lapply(allCols[nodes(subgr)], function(col, selectCols) col[col %in% selectCols], selectCols) } ## needCols <- .getColsByNodes(subgr, selectCols, allCols) ## get list of nodes to visit .bfs <- function(object, node) ## names are bfs order; values are 'from' nodes { bfs <- bfs(object, node) from <- sapply(edges(object)[bfs], function(table, x) { x[which.max(x %in% table)] }, bfs) from[1] <- NA from } ## So our visitNodes then becomes: ## visitNodes = .bfs(subgr, root) ## new version of .getSelects() ## ## select .getSelects <- function(x, keytype, keys, needCols, visitNodes){ ## set up an empty list with names that match what we want to fill... selected <- setNames( vector("list", length(visitNodes)), names(visitNodes)) ## in 1st case we only need the name node1 <- names(visitNodes)[[1]] suppressMessages( selected[[node1]] <- select(.makeReal(x, node1), keys=as.character(keys), columns=needCols[[node1]], keytype=keytype, skipValidKeysTest=TRUE) ) ## but here we need to use the name and the value of visitNodes otherNodes <- visitNodes[-1] for (i in seq_len(length(otherNodes))) { nodeName <- names(otherNodes)[i] fromNode <- otherNodes[i] fromKey <- .mkeys(x, fromNode, nodeName, "tbl1") fromKeys <- unique(selected[[fromNode]][[fromKey]]) ## fromKeys <- fromKeys[!is.na(fromKeys)] toKey <- .mkeys(x, fromNode, nodeName, "tbl2") suppressMessages( selected[[nodeName]] <- select(.makeReal(x, nodeName), keys=as.character(fromKeys), columns=needCols[[nodeName]], keytype=toKey, skipValidKeysTest=TRUE) ) ## We don't do the validity test for keys for at the lower level ## because that doesn't make sense. } selected } ## selected <- .getSelect(kt,keys,needCols, visitNodes) ## new version of .mergeSelectResults ## merge .mergeSelectResults <- function(x, selected, visitNodes, oriCols){ final <- selected[[1]] otherNodes <- visitNodes[-1] for (i in seq_len(length(otherNodes))) { nodeName <- names(otherNodes)[i] fromNode <- otherNodes[i] fromKey <- .mkeys(x, fromNode, nodeName, "tbl1") toKey <- .mkeys(x, fromNode, nodeName, "tbl2") final <- merge(final, selected[[nodeName]], by.x=fromKey, by.y=toKey, all=TRUE) ## recover the col that is lost from the merge ## (header is sometimes needed) lostKeys <- data.frame(toKey=final[[fromKey]]) colnames(lostKeys) <- toKey final <- cbind(final, lostKeys) ## bind b.c lostKeys is post-merge clone } final } ## res <- .mergeSelectResults(selected, visitNodes) ## helper to get fks when it's needed. This returns NULL when not ## appropriate and a special 'compound' list of keys whenever it is. ## .getFksWhenAppropriate <- function(x, keytype){ ## forgnKeys<-as.data.frame(x@keys,stringsAsFactors=FALSE)[,c('xKeys','yKeys')] ## keysIdx<-grepl(keytype, forgnKeys) ## if(any(keysIdx)){ ## getKeys <- as.character(forgnKeys[keysIdx,]) ## will always be one row ## res <- unique(c(keys(x, keytype=getKeys[1]), ## keys(x, keytype=getKeys[2]))) ## }else{ res <- NULL} ## res ## } .hasSynonymousKeys <- function(x, keytype){ forgnKeys<-as.data.frame(x@keys,stringsAsFactors=FALSE)[,c('xKeys','yKeys')] any(grepl(keytype, forgnKeys)) } .select <- function(x, keys, cols, keytype, ...){ ## Argument checking: if(missing(keys)){stop("You must provide a keys argument")} if(missing(cols)){stop("You must provide columns argument")} if(missing(keytype)){stop("You must provide a keytype argument")} ## Some more argument checking skipValidKeysTest <- .hasSynonymousKeys(x, keytype) testSelectArgs(x, keys=keys, cols=cols, keytype=keytype, skipValidKeysTest=skipValidKeysTest) ## if asked for what they have, just return that. if(all(cols %in% keytype) && length(cols)==1L){ res <- data.frame(keys=keys) colnames(res) <- cols return(res) } ## Preserve original cols (we will be adding some to get our results ## along the way oriCols <- cols ## New methods make more use of graph objects. allCols <- .colsByNodes(x) subgr <- .getRelevantSubgraph(x, cols=cols, keys, keytype=keytype) root <- .lookupDbNameFromKeytype(x, keytype) fKeys <- .getForeignKeys(x, subgr) selectCols <- unique(c(keytype, fKeys, cols)) needCols <- .getColsByNodes(subgr, selectCols, allCols) visitNodes <- .bfs(subgr, root) selected <- .getSelects(x, keytype,keys,needCols,visitNodes) res <- .mergeSelectResults(x, selected, visitNodes, oriCols) ## Next we need to filter out all columns that we didn't ask for. ## Actually that is not quite right, what we want to do is make a blacklist ## of columns that were added (in fkeys) and that were NOT requested ## (oriCols and keytype). extraKeys <- .getDbNameFKeys(x) blackList <- extraKeys[!(extraKeys %in% unique(c(oriCols, keytype)))] ## if they asked for one of the GO items, then GO is not blacklisted ## if(any(columns(GO.db) %in% oriCols)){ ## blackList <- blackList[!(blackList %in% "GO")] ## } res <- res[,!(colnames(res) %in% blackList), drop=FALSE] ## Then call code to clean up, reorder the rows (and add NA rows as needed). if(nrow(res) > 0L){ res <- resort_base(res, keys, keytype, colnames(res)) } # unique(res) ## NO! We don't want to do this. res } setMethod("select", "MultiDb", function(x, keys, columns, keytype, ...){ ## .selectWarnOrganismDb(x, keys, columns, keytype, ...) .select(x, keys, columns, keytype, ...) } ) ##TODO: .mergeSelectResults is leaving incorrect labels on things: Clean this up! ## methods for easy DB access: .dbconn <- function(x){ dbs <- .getDbObjs(x) res <- unique(unlist(lapply(dbs, dbconn))) names(res) <- names(dbs) res } setMethod("dbconn", "MultiDb", function(x){.dbconn(x)}) .dbfile <- function(x){ dbs <- .getDbObjs(x) res <- unique(unlist(lapply(dbs, dbfile))) names(res) <- names(dbs) res } setMethod("dbfile", "MultiDb", function(x){.dbfile(x)}) ## mapIds ## Standard methods: setMethod("mapIds", "MultiDb", function(x,keys,column,keytype,...,multiVals) mapIds_base(x,keys,column,keytype,...,multiVals=multiVals) ) ## library(Homo.sapiens); debug(OrganismDbi:::.mapIds); ## mapIds(Homo.sapiens, keys=c('1','10'), column='ALIAS', keytype='ENTREZID', multiVals="CharacterList") ## TODO: add some unit tests for this. ## mapIds(Homo.sapiens, c('1','10'), 'GENEID', 'ENTREZID') ## taxonomyId for MultiDb relies on the TxDb object. ## this could be changed to instead check the OrgDb object. ## but that would require adding a new helper "getOrgDbIfAvailable()" .taxonomyId <- function(x){ txdb <- getTxDbIfAvailable(x) taxonomyId(txdb) } setMethod("taxonomyId", "MultiDb", function(x){.taxonomyId(x)}) ######################################################################### ## New method (experimental) to just see if we can make it easier for ## people who have RANGES and then want to see the associated ## annotations. ## Eventually, we may want to let the user choose which annotation ## range accessor should be called to see if their ranges overlap ## (with an 'annotFUN' argument). ## BUT RIGHT NOW: this will just do the simplest possible thing: ## issues: ## 1) exons, transcripts returns redundant results... (I really want ## exonsBy(x, by='gene') and then collapse the result. Unfortunately, this means that the metadata is not in the mcols. ## 2) use transcriptsBy() for 'genes' (more accurate) - (currently they are both offered) ## 3) utrs and introns (similar issues to #1 above) ## 4) for utrs and introns, what you get back is grouped by transcript. So I need to be able to re-group the results by gene OR (failing that, just call findOverlaps on *that* and the do: ## 5) And then I also (separately) need to be able to get the columns for these genes by calling select and then compressing that result to a DataFrame that can be put into mcols. .selectByRanges <- function(x, ranges, columns=c('ENTREZID','SYMBOL'), overlaps=c('gene','tx','exon', 'cds', 'intron','5utr','3utr'), ignore.strand=FALSE){ ## Make sure everyone is OK with overlaps as argument name ## overlaps <- match.arg(overlaps, several.ok = TRUE) overlaps <- match.arg(overlaps) subj <- switch(overlaps, gene=genes(x,columns=columns), exon=exonsBy(x,columns=columns,by='gene',outerMcols=TRUE), cds=cdsBy(x,columns=columns,by='gene',outerMcols=TRUE), tx=transcriptsBy(x,columns=columns,by='gene',outerMcols=TRUE), ## the next three all return GRL grouped by transcripts... '5utr'=fiveUTRsByTranscript(x), '3utr'=threeUTRsByTranscript(x), intron=intronsByTranscript(x) ) ## Then get the mcols if(overlaps %in% c('gene','tx','exon', 'cds')){ ## Next do the overlaps hits <- findOverlaps(query=ranges, subject=subj, ignore.strand=ignore.strand) results <- ranges[queryHits(hits)] mcols(results) <- mcols(subj[subjectHits(hits)]) }else{ ## then it's mapped to transcripts so: ## Here we have to get our metadata set up and compressed FIRST keys <- names(subj) ## keys are NOT unique) ## Get basic metadata mapped to TXID meta <- select(x, keys=keys, columns=columns, keytype='TXID') ## Then compress based on the TXID keytype fa <- factor(meta[['TXID']], levels=unique(as.character(keys))) metaC <- .compressMetadata(fa, meta, avoidID='TXID') ## Then attach this compressed data onto the subject mcols(subj) <- metaC ## THEN we can do our overlaps hits <- findOverlaps(query=ranges, subject=subj, ignore.strand=ignore.strand) results <- ranges[queryHits(hits)] mcols(results) <- mcols(subj[subjectHits(hits)]) ## because we mapped by TXIDs, we have to remove redundant results dfRes <- as(results,'data.frame') uniqueIdx = !duplicated(dfRes) results <- results[uniqueIdx] } results } setMethod("selectByRanges", "MultiDb", function(x,ranges,columns,overlaps,ignore.strand){ if(missing(overlaps)){ overlaps <- 'tx' } if(missing(columns)){ columns <- c('ENTREZID','SYMBOL') } if(missing(ignore.strand)){ ignore.strand <- FALSE } .selectByRanges(x,ranges,columns,overlaps,ignore.strand)}) ## ## Some Testing ## library(Homo.sapiens); ## ranges <- GRanges(seqnames=Rle(c('chr11'), c(2)),IRanges(start=c(107899550, 108025550), end=c(108291889, 108050000)), strand='+', seqinfo=seqinfo(Homo.sapiens)) ## selectByRanges(Homo.sapiens, ranges, 'SYMBOL') ## selectByRanges(Homo.sapiens, ranges, 'SYMBOL', 'exon') ## selectByRanges(Homo.sapiens, ranges, 'ENTREZID') ## ## What if they ask for something more compex? ## selectByRanges(Homo.sapiens, ranges, 'ALIAS') ## ## What if they ask for a couple things? ## selectByRanges(Homo.sapiens, ranges, c('ENTREZID','ALIAS')) ## The following should all give the same basic answer (because ranges ## don't change) ## selectByRanges(Homo.sapiens, ranges, c('SYMBOL','PATH'), 'tx') ## selectByRanges(Homo.sapiens, ranges, c('SYMBOL','PATH'), 'exon') ## selectByRanges(Homo.sapiens, ranges, c('SYMBOL','PATH'), 'cds') ## selectByRanges(Homo.sapiens, ranges, c('SYMBOL','PATH'), '5utr') ## selectByRanges(Homo.sapiens, ranges, c('SYMBOL','PATH'), '3utr') ## selectByRanges(Homo.sapiens, ranges, c('SYMBOL','PATH'), 'intron') ## Current troubles: ## debug(OrganismDbi:::.selectByRanges) ## selectByRanges(Homo.sapiens, ranges, c('SYMBOL','PATH'), '5utr') ## 3) The documentation and unit tests need a big upgrade... ## 4) I need an early version of Vinces complementary function still (selectRangesBy) ############################################################################# ## FOR LATER sticky I want to implement support for multiple values of ## 'overlaps' ## So I want to be able to do something kind of like this (to implement the geometry idea of multiple 'overlaps' arguments). BUT: it doesn't respect the contents of mcols... ## foo = selectByRanges(Homo.sapiens, ranges, c('SYMBOL','PATH'), '5utr'); ## bar = selectByRanges(Homo.sapiens, ranges, c('SYMBOL','PATH'), 'tx'); ## unique(c(foo, bar)) ## The 'data.frame' shuffle might help here. IOW: ## results <- c(foo,bar) ## dfRes <- as(results,'data.frame') ## uniqueIdx = !duplicated(dfRes) ## results <- results[uniqueIdx] ## But then: we would still have to deal with the fact that we have to 'merge' overlapping ranges (and keep the metadata)... ## ALTERNATIVELY: I *could* implement this by just using the transcript centered strategy that I used above (for UTRs/introns), but applying it to 'everything', THEN merging all the tx centered metadata into a tx ID'd list and then overlapping as the last step. ## Older notes about this from the sticky: ## 'overlaps' CAN be a vector (which will result in multiple ranges getting summed together). - this suggestion is going to have to be it's own entirely separate sticky b/c the standard mechanisms for combining the results do *not* currently have any mechanism for respecting the geometry. EITHER THAT, or I am going to have to change the way that the whole function works (again), by handling everything more the way that I currently handle things for 5UTRs/3UTRs/introns. (IOW get all the ranges, always grouped by transcript, then combine to form one transcript oriented list and *then* annotate them, and then (at the end): overlap. ## New function (inspired by Vince) that will get ranges based on IDs. .selectRangesById <- function(x, keys, columns=character(), keytype='GENEID', feature=c('gene','tx','exon', 'cds') ){ ## Argument checks feature <- match.arg(feature) ## Then map the keys to GENEID (NOT ENTREZID) genes <- mapIds(x, keys, 'GENEID', keytype) ## Then filter the genes (because remember that not all keys will ## have a gene model in the TxDb) genes <- genes[genes %in% keys(x,'GENEID')] ## Stop if there are no keys remaining. if(length(genes) <1){ msg <- strwrap(paste0("None of the requested features has a gene ", "model in the TxDb Database.")) stop(msg) } ## then get the gene models rngs <- switch(feature, gene=genes(x,columns=columns), exon=exonsBy(x,columns=columns,by='gene',outerMcols=TRUE), cds=cdsBy(x,columns=columns,by='gene',outerMcols=TRUE), tx=transcriptsBy(x,columns=columns,by='gene',outerMcols=TRUE) ) ## Then subset those with the genes ids rngs <- rngs[genes] ## Only rename if we can safely do so keyNames <- mapIds(x, genes, column=keytype, 'GENEID') if(length(keyNames)==length(rngs)){ names(rngs) <- keyNames } rngs } setMethod("selectRangesById", "MultiDb", function(x,keys,columns,keytype,feature){ if(missing(columns)){ columns <- character() } if(missing(keytype)){ keytype <- 'GENEID' } if(missing(feature)){ feature <- 'tx' } .selectRangesById(x,keys,columns,keytype,feature)}) ## library(Homo.sapiens); ## debug(OrganismDbi:::.selectRangesById) ## selectRangesById(Homo.sapiens, c('1','100')) ## selectRangesById(Homo.sapiens, keys='1',columns=c('PATH','SYMBOL'), keytype='GENEID', 'exon' ) ## TODO: re-test these after fixing the ENTREZID <-> GENEID problem ## These all have the bad ENTREZID <-> GENEID error (have to fix elsewhere) ## selectRangesById(Homo.sapiens, 'A1BG', keytype='SYMBOL') ## selectRangesById(Homo.sapiens, keys='A1BG',columns=c('PATH','SYMBOL'), keytype='SYMBOL', 'tx' ) ## selectRangesById(Homo.sapiens, keys='A1BG',columns=c('PATH','SYMBOL'), keytype='SYMBOL', 'cds' ) ## This should error out: ## selectRangesById(Homo.sapiens, "11", columns=c('SYMBOL','TXNAME','TXID'), keytype='ENTREZID', feature='tx') ## But this should work: ## selectRangesById(Homo.sapiens, c("1","11"), columns=c('SYMBOL','TXNAME','TXID'), keytype='ENTREZID', feature='tx') ## selectRangesById(Homo.sapiens, keys(Homo.sapiens,'SYMBOL'), columns=c('SYMBOL','TXNAME','TXID'), keytype='SYMBOL', feature='tx') ## perf test ## system.time(res <- selectRangesById(Homo.sapiens, c("1","11"), keytype='ENTREZID', feature='tx')) ## VS ## system.time(res <- selectRangesById(Homo.sapiens, c("1","11"), columns=c('SYMBOL','TXNAME','TXID'), keytype='ENTREZID', feature='tx')) ## Major Problems remaining for supporting UTR/introns: ## 1) for UTR/introns I am getting back transcript centric ranges vs gene centric for everything else... Inconsistency like this is very bad - and it does not get corrected like it did for selectByRanges because there is no call to findOverlaps here... ## 2) For UTR/introns I am returning a list object, but the mcols I assign metadata into are the outer mcols (needs to be put on the inside too - at a minimum). This all causes the following to break because the filtering at the end won't have all the metadata that it needs anymore... ## selectRangesById(Homo.sapiens, keys=c('1','10'),columns=c('PATH','SYMBOL'), keytype='GENEID', '5utr' ) ## Because of this, I think I will put the addition of UTR/introns on hold untill I have other functions that can extract these things in a 'gene centric' manner - instead of in the current transcript centered only manner. ####################################################################### ## I need a way to get the inner mcols back out to the outer mcols (and quickly) ## Herve has a helper for extracting 'inner' mcols out of a GRanges ## list very quickly. makeOuterMcolFromInnerMcol <- function(x, colname) { if (!is(x, "List")) stop("'x' must be a List object") ## tmpOri <- unique(relist(mcols(unlist(x, use.names=FALSE))[[colname]], x)) tmp <- unique(relist(as.character(mcols(unlist(x, use.names=FALSE))[[colname]]),x)) if (any(elementNROWS(tmp) != 1L)) stop(colname, " inner metadata column cannot be made an outer metadata column") unlist(tmp) } ## Let's try with the exon_rank metadata col on the object returned by ## exonsBy() (should return an error): ## ex_by_tx <- exonsBy(txdb) ## mcols(ex_by_tx)[["exon_rank"]] <- makeOuterMcolFromInnerMcol(ex_by_tx, "exon_rank") ## but this should work with an inner metadata column that is really ## an attribute of the top-level list elements. ## unfortunately, this function seems to have some bugs. (Which I think I have mostly fixed) ## BUT: The function needs to extract all the viable mcols, and to format them as a DataFrame so that they can be put into the outer mcols for the results object. ## AND even if I get this function working perfectly, I need to check do one of TWO things for each column. If the column is from the inner column level (exon_rank) then I can't return that data in the result (since it won't map back out to the gene level). These inner columns are not lost. You might think that there is no sensible way to map them out to the result in the function above but they can just go into a integerList object... So things like 'EXONRANK' will have to be processed differently... In the case there the data is actually repeated from the outer column level then I need to use a variant of this function to map it back out. So: two different things need to happen based on whether the data is repeated or not... ## the helpers exonsBy and transcriptsBy etc. need to be 'fixed' so that (for viable mcols) they have their outer mcols populated. This will help since for the annotation recover, the outer mcols are the only ones I will want to use anyways. This is still true for things like EXONRANK since for EXONRANK I will have an integerList (for example). The bottom line is: everthing in that outer mcols needs to be annotated at the 'GROUP level' regardless of what is in the inner mcols... Once I have these base methods acting better it should be easy for my methor to do the right thing... ## there appears to be another separate bug that happens with exonsBy(x, by='tx') where the extra columns are not fully populated. This needs to be fixed but doesn't happen with by='gene'... ## stash some private variables to hold the information about which columns are viable and which ones are not (for this). This will help me to dispatch on columns that nee to be treated separately ## I may need a different accessor to list 'outer' columns, or I may need to add an argument to columns (geneLevel=TRUE) OrganismDbi/R/methods-transcripts.R0000644000175200017520000004541714136050033020341 0ustar00biocbuildbiocbuild## This is where I will put methods to overload things like ## transcripts() and exons()... ## new argument: columns here can be any legit value for columns. (not just ## tx_id and tx_name etc.) ## filter will just pass through to the internal transcripts call. ## columns arg is just for b/c support and will just pass through to ## the internal transcripts call ## ## For consistency, the helper columns just wraps around cols method... ## setMethod("columns", "MultiDb", function(x){.cols(x)}) ## .getTxDb <- function(x){ ## ## trick: there will *always* be a TXID ## res <- .lookupDbFromKeytype(x, "TXID") ## if(!is.null(res)){ ## return(res) ## }else{ ## return(NA) ## } ## } .getTxDb <- function(x){ res <- x@txdbSlot if(!is.null(res)){ return(res) }else{ return(NA) } } ## expose method for gettting A TxDb (if there is one) setMethod("getTxDbIfAvailable", "MultiDb", function(x, ...){.getTxDb(x)}) ## And actually, just make a setter / getter for TxDbs on OrganismDb objects ## In this case, these methods are exclusive to OrganismDb objects. ## getter setMethod("TxDb", "OrganismDb", function(x, ...){.getTxDb(x)}) ## TxDb setter method ## .updateTxDb() helper makes graphInfo from mdb, modifies it to use new ## txdb info and then calls MultiDb() constructor... .updateTxDb <- function(x, value){ ## Here I need to work out what needs an update and update it... ## I need to find the TxDb in the object and replace it with ## the one in value if(class(value) != 'TxDb') stop('Replacement value must be a TxDb object.') ## 1st get the current TxDbs name txDbName <- OrganismDbi:::.lookupDbNameFromKeytype(x, 'TXID') ## we will use a generated name for internals when user does this. newTxDbName <- makePackageName(value) ## To modify the TxDb value rebuild the MultiDb ## 1) Extract/modify the keys/graphData gd <- x@keys gd[gd %in% txDbName] <- newTxDbName ## 2) Extract/modify the resources resources <- x@resources resources[names(resources) %in% txDbName] <- dbfile(value) names(resources)[names(resources) %in% txDbName] <- newTxDbName ## 3) rebuild (which should populate any slots etc.) graphInfo <- list(graphData=gd, resources=resources) x <- OrganismDbi:::OrganismDb(graphInfo=graphInfo) ## then return the new MultiDb object. x } setReplaceMethod("TxDb", "OrganismDb", function(x, value) .updateTxDb(x, value)) ## test for setter: ## library(OrganismDbi); example(makeOrganismDbFromTxDb); odb; ## saveDb(txdbMouse, file='myTxDb.sqlite') ## library(TxDb.Mmusculus.UCSC.mm9.knownGene); txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene; ## library(OrganismDbi); txdbMouse <- loadDb('myTxDb.sqlite'); odb <- makeOrganismDbFromTxDb(txdb=txdbMouse) ## debug(OrganismDbi:::.updateTxDb) ## works! ## debug(OrganismDbi:::.getTxDb) ## TxDb(odb) <- txdb; odb; odb@resources ## OK. It looks like setter is working, but for this particular ## example, my getter is still grabbing from global scope sometimes? ## (like in this example) - all because the names are being used too ## much by things like getters for the object. So in this example, I ## am at the mercy of the last thing that was loaded... :/ ## Actually, what I want to do is to make sure that any time I get DB ## resource, I first try to do loadDb on the matching thing from the ## resources slot AND THEN try to call get() on an existing name (but ## as a backup plan!). - and this seems to work, but there is a bad fail in one unit test that I have to check into now... ## And if I can't solve the problem in that way: ## Then this can also be made more robust if instead I put an actual ## TxDb and OrgDb object into custom slots (with custom getter methods ## for each that act only for OrganismDb objects). That way I can ## insulate my object from external interference from name clashes. ######################################################################## ## TODO: .compressMetadata() might be useful to move into IRanges, as ## a complement to expand() methods? - Discuss this with Val (who ## apparently may have similar issues in vcf... ## .compressMetadata() processes data.frame data into a DataFrame with ## compressed chars ## It does so by taking a special factor (f) and then applying it to ## ALL of the columns in a data.frame (meta) except for the one that ## was the basis for the special factor (avoidID) .compressMetadata <- function(f, meta, avoidID){ meta <- meta[, !colnames(meta) %in% avoidID, drop=FALSE] meta <- lapply(meta, function(column, f) unique(splitAsList(column, f)), f) DataFrame(meta, row.names=NULL) } ## This helper does book keeping that is relevant to my situation here. .combineMetadata <- function(rngs, meta, avoidID, joinID, columns){ ## compress the metadata by splitting according to f joinValue <- as.character(mcols(rngs)[[joinID]]) f <- factor(meta[[avoidID]], levels=unique(joinValue)) if (anyNA(f)) stop("not all annotations have matching ranges") if (avoidID %in% columns) # don't avoid the avoidID avoidID <- NULL res <- .compressMetadata(f, meta, avoidID) ridx <- match(joinValue, levels(f)) res <- c(mcols(rngs), res[ridx, , drop=FALSE]) res[!(colnames(res) %in% c("tx_id","exon_id","cds_id","gene_id"))] } ## How will we merge the results from select() and transcripts()? We ## will join on tx_id (for transcripts) .transcripts <- function(x, columns, filter){ ## 1st get the TxDb object. txdb <- .getTxDb(x) ## call transcripts method (on the TxDb) txs <- transcripts(txdb, columns="tx_id", filter=filter) ## call select on the rest and use tx_id as keys meta <- select(x, keys=as.character(mcols(txs)$tx_id), columns, "TXID") ## assemble it all together. mcols(txs) <- .combineMetadata(txs,meta,avoidID="TXID",joinID="tx_id", columns=columns) txs } setMethod("transcripts", "MultiDb", function(x, columns=c("TXID", "TXNAME"), filter=NULL){ .transcripts(x, columns, filter)} ) ## test usage: ## library(Homo.sapiens); h = Homo.sapiens; columns = c("TXNAME","SYMBOL") ## transcripts(h, columns) ## How will we merge the results from select() and transcripts()? We ## will join on tx_id (for transcripts) .exons <- function(x, columns, filter){ ## 1st get the TxDb object. txdb <- .getTxDb(x) ## call transcripts method (on the TxDb) exs <- exons(txdb, columns="exon_id", filter=filter) ## call select on the rest and use tx_id as keys meta <- select(x, keys=as.character(mcols(exs)$exon_id), columns, "EXONID") ## assemble it all together. mcols(exs) <- .combineMetadata(exs,meta,avoidID="EXONID",joinID="exon_id", columns=columns) exs } setMethod("exons", "MultiDb", function(x, columns="EXONID", filter=NULL){ .exons(x, columns, filter)}) ## test usage: ## library(Homo.sapiens); h = Homo.sapiens; columns = c("CHR","REFSEQ") ## exons(h, columns) ## How will we merge the results from select() and transcripts()? We ## will join on tx_id (for transcripts) .cds <- function(x, columns, filter){ ## 1st get the TxDb object. txdb <- .getTxDb(x) ## call transcripts method (on the TxDb) cds <- cds(txdb, columns="cds_id", filter=filter) ## call select on the rest and use tx_id as keys meta <- select(x, keys=as.character(mcols(cds)$cds_id), columns, "CDSID") ## assemble it all together. mcols(cds) <- .combineMetadata(cds,meta,avoidID="CDSID",joinID="cds_id", columns=columns) cds } setMethod("cds", "MultiDb", function(x, columns="CDSID", filter=NULL){ .cds(x, columns, filter)}) ## test usage: ## library(Homo.sapiens); h = Homo.sapiens; columns = c("GENENAME","SYMBOL") ## cds(h, columns) ## How will we merge the results from select() and transcripts()? We ## will join on tx_id (for transcripts) .genes <- function(x, columns, filter){ ## 1st get the TxDb object. txdb <- .getTxDb(x) ## call transcripts method (on the TxDb) genes <- genes(txdb, columns="gene_id", filter=filter) ## call select on the rest and use tx_id as keys meta <- select(x, keys=as.character(mcols(genes)$gene_id), columns, "GENEID") ## assemble it all together. mcols(genes) <- .combineMetadata(genes,meta,avoidID="GENEID", joinID="gene_id", columns=columns) genes } setMethod("genes", "MultiDb", function(x, columns="GENEID", filter=NULL){ .genes(x, columns, filter)}) ## test usage: ## library(Homo.sapiens); h = Homo.sapiens; columns = c("GENENAME","SYMBOL") ## genes(h, columns) ######################################################################## ######################################################################## ## The "By" methods ######################################################################## ######################################################################## ## "By" methods will just cram the same metadata into the INTERNAL ## metadata slot so that it appears with the show method. ## No attempt will be made to manage the insanity of knowing which ## metadata types belong in which spot... .byToKeytype <- function(by){ switch(by, 'gene'='GENEID', 'exon'='EXONID', 'cds'='CDSID', 'tx'='TXID') } .transcriptsBy <- function(x, by, columns, use.names, outerMcols){ ## 1st get the TxDb object. txdb <- .getTxDb(x) txby <- transcriptsBy(txdb, by=by, use.names=use.names) if(length(columns) >= 1){ ## get the tx_ids from the transcripts ## AND I need to one from the internal slot. gr <- txby@unlistData k <- as.character(mcols(gr)$tx_id) ## call select on the rest and use tx_id as keys meta <- select(x, keys=k, columns, "TXID") ## assemble it all together. mcols(gr) <- .combineMetadata(gr, meta, avoidID="TXID", joinID="tx_id", columns=columns) ## now cram it back in there. txby@unlistData <- gr ## AND ALSO put the metadata in for the 'outer' mcols... if(outerMcols==TRUE){ k2 <- names(txby) keytype <- .byToKeytype(by) meta2 <- select(x, keys=k2, columns, keytype) ## Step here needed to make meta2 from data.frame into DataFrame f <- factor(meta2[[keytype]], levels=unique(as.character(k2))) metaC <- .compressMetadata(f, meta2, avoidID=keytype) mcols(txby) <- metaC } } txby } setMethod("transcriptsBy", "MultiDb", function(x, by="gene", columns=character(), use.names=FALSE, outerMcols=FALSE){ if(missing(by) || !any(by %in% c("gene","exon","cds")) || length(by) !=1){ stop("You must provide a valid argument for by")} .transcriptsBy(x, by, columns, use.names=use.names, outerMcols=outerMcols)}) .exonsBy <- function(x, by, columns, use.names, outerMcols){ ## 1st get the TxDb object. txdb <- .getTxDb(x) exby <- exonsBy(txdb, by=by, use.names=use.names) if(length(columns) >= 1){ ## get the tx_ids from the transcripts ## AND I need to one from the internal slot. gr <- exby@unlistData k <- as.character(mcols(gr)$exon_id) ## call select on the rest and use tx_id as keys meta <- select(x, keys=k, columns, "EXONID") ## assemble it all together. mcols(gr) <- .combineMetadata(gr, meta, avoidID="EXONID", joinID="exon_id", columns=columns) ## now cram it back in there. exby@unlistData <- gr ## AND ALSO put the metadata in for the 'outer' mcols... if(outerMcols==TRUE){ k2 <- names(exby) keytype <- .byToKeytype(by) meta2 <- select(x, keys=k2, columns, keytype) ## Step here needed to make meta2 from data.frame into DataFrame f <- factor(meta2[[keytype]], levels=unique(as.character(k2))) metaC <- .compressMetadata(f, meta2, avoidID=keytype) mcols(exby) <- metaC } } exby } setMethod("exonsBy", "MultiDb", function(x, by="tx", columns=character(), use.names=FALSE, outerMcols=FALSE){ if(missing(by) || !any(by %in% c("tx", "gene")) || length(by) !=1){ stop("You must provide a valid argument for by")} .exonsBy(x, by, columns, use.names=use.names, outerMcols=outerMcols)}) .cdsBy <- function(x, by, columns, use.names, outerMcols){ ## 1st get the TxDb object. txdb <- .getTxDb(x) cdsby <- cdsBy(txdb, by=by, use.names=use.names) if(length(columns) >= 1){ ## get the tx_ids from the transcripts ## AND I need to one from the internal slot. gr <- cdsby@unlistData k <- as.character(mcols(gr)$cds_id) ## call select on the rest and use tx_id as keys meta <- select(x, keys=k, columns, "CDSID") ## assemble it all together. mcols(gr) <- .combineMetadata(gr, meta, avoidID="CDSID", joinID="cds_id", columns=columns) ## now cram it back in there. cdsby@unlistData <- gr ## AND ALSO put the metadata in for the 'outer' mcols... if(outerMcols==TRUE){ k2 <- names(cdsby) keytype <- .byToKeytype(by) meta2 <- select(x, keys=k2, columns, keytype) ## Step here needed to make meta2 from data.frame into DataFrame f <- factor(meta2[[keytype]], levels=unique(as.character(k2))) metaC <- .compressMetadata(f, meta2, avoidID=keytype) mcols(cdsby) <- metaC } } cdsby } setMethod("cdsBy", "MultiDb", function(x, by="tx", columns=character(), use.names=FALSE, outerMcols=FALSE){ if(missing(by) || !any(by %in% c("tx", "gene")) || length(by) !=1){ stop("You must provide a valid argument for by")} .cdsBy(x, by, columns, use.names=use.names, outerMcols=outerMcols)}) ## TODO: (known issues) ## 1) columns don't come back in same order that the went in ## 2) some values (tx_id and tx_name come to mind) are not relabeled ## in a pretty way and may not have been requested (to solve this we ## have to adress issue #3) ## 3) I now have a columns AND a columns argument for the transcripts() ## family of methods. This is totally redundant. Proposed fix: ## rename arguments base method to be columns (maybe this is also an ## opportunity to rename columns everywhere), but rename it so that it's ## consistent, and then here, just only have one argument... ## 4) exonsBy and cdsBy may have some extra issues that I am missing... ############################################################################### ## Wrapper methods for TxDb methods (that are the same) setMethod(promoters, 'MultiDb', function(x, upstream=2000, downstream=200, ...){ promoters(getTxDbIfAvailable(x), upstream, downstream, ...)}) setMethod(disjointExons, 'MultiDb', function(x, aggregateGenes=FALSE, includeTranscripts=TRUE, ...){ disjointExons(getTxDbIfAvailable(x), aggregateGenes, includeTranscripts, ...)}) setMethod(microRNAs, 'MultiDb', function(x){microRNAs(getTxDbIfAvailable(x))}) setMethod(tRNAs, 'MultiDb', function(x){tRNAs(getTxDbIfAvailable(x))}) setMethod(intronsByTranscript, 'MultiDb', function(x, use.names=FALSE){ intronsByTranscript(getTxDbIfAvailable(x), use.names=use.names)}) setMethod(fiveUTRsByTranscript, 'MultiDb', function(x, use.names=FALSE){ fiveUTRsByTranscript(getTxDbIfAvailable(x), use.names=use.names)}) setMethod(threeUTRsByTranscript, 'MultiDb', function(x, use.names=FALSE){ threeUTRsByTranscript(getTxDbIfAvailable(x), use.names=use.names)}) setMethod(extractUpstreamSeqs, 'MultiDb', function(x, genes, width=1000, exclude.seqlevels=NULL){ extractUpstreamSeqs(x, getTxDbIfAvailable(genes), width=width, exclude.seqlevels=exclude.seqlevels)}) ## problem: no way for this to dispatch correctly... ## So we are basically trampling the original method definition ## here. (which is just NOT elegant) ## But: we plan to move OrganismDbi down into TxDbs later on so this ## is temporary. ## setMethod(extractTranscriptSeqs, 'BSgenome', ## function(x, transcripts, strand="+"){ ## if(class(transcripts)=='MultiDb'){ ## transcripts<-getTxDbIfAvailable(transcripts) ## } ## extractTranscriptSeqs(x, transcripts, strand=strand)}) ## This works now without the need for an overload... ## library(Homo.sapiens);library(BSgenome.Hsapiens.UCSC.hg19);genome <- BSgenome.Hsapiens.UCSC.hg19;debug(GenomicFeatures:::.normarg_transcripts);tx_seqs <- extractTranscriptSeqs(genome, Homo.sapiens) setMethod(isActiveSeq, 'MultiDb', function(x){isActiveSeq(getTxDbIfAvailable(x))}) .updateTxDbSeqMultiDb <-function(x, value){ ## This will change the val in 'x' as well... txdb <- getTxDbIfAvailable(x) if (is(txdb, "TxDb")) { ## will be NA if there isn't one. isActiveSeq(txdb) <- value } else { stop('This object does not contain a TxDb object') } x } setReplaceMethod('isActiveSeq', 'MultiDb', function(x, value){.updateTxDbSeqMultiDb(x, value)}) setMethod(asBED, 'MultiDb', function(x){asBED(getTxDbIfAvailable(x))}) setMethod(asGFF, 'MultiDb', function(x){asGFF(getTxDbIfAvailable(x))}) ## These ones dispatch on compound types (not just TxDbs): setMethod(distance, c('GenomicRanges','MultiDb'), function(x, y, ignore.strand=FALSE, ..., id, type=c("gene", "tx", "exon", "cds")){ distance(x, getTxDbIfAvailable(y), ignore.strand=ignore.strand, ..., id=id, type=type)}) setMethod(mapToTranscripts, c('ANY', 'MultiDb'), function(x, transcripts, ignore.strand=TRUE, extractor.fun = GenomicFeatures::transcripts, ...){ mapToTranscripts(x, transcripts=getTxDbIfAvailable(transcripts), ignore.strand=ignore.strand, extractor.fun = extractor.fun, ... )}) OrganismDbi/R/seqinfo.R0000644000175200017520000000607114136050033015761 0ustar00biocbuildbiocbuild## TODO: ## 1) outline and export (etc.) the methods. ## 2) hook them up to the proper subcomponents (x$txdb) ## 3) need a helper to make the extraction of the txdb SAFE and warn if this is not possible. .safelyGetTxDb <- function(x){ if("TXID" %in% columns(x)){ return(.getTxDb(x)) }else{ stop("The MultiDb object does not have an embedded TxDb.") } } ## with the exception of seqnames (which doesn't make sense, this ## whole family will work for MultiDb objects (even though I have ## not defined them all explicitely (because they all call seqinfo() ## and have ANY methods) ## So I only need to alias seqinfo and seqnameStyle in the manual page ## (and then put in a see also reference) ############################################################################# ## To get all these functions, I ONLY need to make this setter and getter work setMethod("seqinfo","MultiDb", function(x){ txdb <- .safelyGetTxDb(x) seqinfo(txdb) }) ## This can work once I have a local object that I can modify? ## setReplaceMethod("seqinfo", "MultiDb", ## function(x, new2old=NULL, force=FALSE, value){ ## txdb <- .safelyGetTxDb(x) ## seqinfo(txdb, new2old=NULL, force=FALSE) <- value ## }) ## ############################################################################# ## setMethod("seqlevels","MultiDb", ## function(x){ seqlevels(.safelyGetTxDb(x)) ## }) ## setReplaceMethod("seqlevels", "MultiDb", ## function(x, force=FALSE, value){ ## seqlevels(.safelyGetTxDb(x), force=FALSE) <- value ## }) ## ############################################################################# ## setMethod("seqlengths","MultiDb", ## function(x){ seqlengths(.safelyGetTxDb(x)) ## }) ## setReplaceMethod("seqlengths", "MultiDb", ## function(x, value){ ## seqlengths(.safelyGetTxDb(x)) <- value ## }) ## ############################################################################# ## setMethod("isCircular","MultiDb", ## function(x){ isCircular(.safelyGetTxDb(x)) ## }) ## setReplaceMethod("isCircular", "MultiDb", ## function(x, value){ ## isCircular(.safelyGetTxDb(x)) <- value ## }) ## ############################################################################# ## setMethod("genome","MultiDb", ## function(x){ genome(.safelyGetTxDb(x)) ## }) ## setReplaceMethod("genome", "MultiDb", ## function(x, value){ ## genome(.safelyGetTxDb(x)) <- value ## }) ############################################################################# ## setMethod("seqnameStyle","MultiDb", ## function(x){ ## txdb <- .safelyGetTxDb(x) ## seqnameStyle(txdb) ## }) ## setReplaceMethod("seqnameStyle", "MultiDb", ## function(x, value){ ## txdb <- .safelyGetTxDb(x) ## seqnameStyle(txdb) <- value ## x@ ## }) OrganismDbi/R/test_OrganismDbi_package.R0000644000175200017520000000007614136050033021224 0ustar00biocbuildbiocbuild.test <- function() BiocGenerics:::testPackage("OrganismDbi") OrganismDbi/build/0000755000175200017520000000000014136076334015100 5ustar00biocbuildbiocbuildOrganismDbi/build/vignette.rds0000644000175200017520000000036314136076334017441 0ustar00biocbuildbiocbuild}QK 0MEQrܸ.,Ņ۱M%h;ONL&yy9 !(%(N@2"ee/xV,X!g)VR*F("HpEYC)/4٨\%K9_XO"{ i}zw=}X%-orv^ÍuӿUnxlJ[8k'5'ԁ^lOrganismDbi/inst/0000755000175200017520000000000014136076334014756 5ustar00biocbuildbiocbuildOrganismDbi/inst/OrgPkg-template/0000755000175200017520000000000014136050033017744 5ustar00biocbuildbiocbuildOrganismDbi/inst/OrgPkg-template/DESCRIPTION0000644000175200017520000000061614136050033021455 0ustar00biocbuildbiocbuildPackage: @PKGNAME@ Title: @PKGTITLE@ Description: @PKGDESCRIPTION@ Version: @PKGVERSION@ Author: @AUTHOR@ Maintainer: @MAINTAINER@ Depends: R (>= 1.6), AnnotationDbi (>= 1.17.11), methods, OrganismDbi, GenomicFeatures, @DEPENDENCIES@ Imports: GenomicFeatures, AnnotationDbi License: @LIC@ organism: @ORGANISM@ species: @ORGANISM@ biocViews: AnnotationData, Genetics, OrganismDb, @ORGANISMBIOCVIEW@ OrganismDbi/inst/OrgPkg-template/NAMESPACE0000644000175200017520000000011414136050033021157 0ustar00biocbuildbiocbuild# A NAMESPACE is strictly required by R 2.14 or higher. import(OrganismDbi) OrganismDbi/inst/OrgPkg-template/R/0000755000175200017520000000000014136050033020145 5ustar00biocbuildbiocbuildOrganismDbi/inst/OrgPkg-template/R/zzz.R0000644000175200017520000000045614136050033021132 0ustar00biocbuildbiocbuild## .onLoad gets the data.frame from the /data directory .onLoad <- function(libname, pkgname) { load(system.file("data","graphInfo.rda",package=pkgname, lib.loc=libname)) OrganismDbi:::.loadOrganismDbiPkg(pkgname=pkgname, graphInfo=graphInfo) } OrganismDbi/inst/OrgPkg-template/man/0000755000175200017520000000000014136050033020517 5ustar00biocbuildbiocbuildOrganismDbi/inst/OrgPkg-template/man/@PKGNAME@.Rd0000644000175200017520000000154714136050033022277 0ustar00biocbuildbiocbuild\name{@PKGNAME@} \docType{data} \alias{@PKGNAME@} \title{Annotation package that collates several annotation resources.} \description{This data object was automatically created by @AUTHOR@. It represents a collection of annotation packages that can be used as a single object named @PKGNAME@. This object can be used with the standard four accessor method for all AnnotationDbi objects. Namely: \code{cols}, \code{keytype}, \code{keys} and \code{select}. Users are encouraged to read the vignette from the \code{OrganismDbi} package for more details.} \usage{library(@PKGNAME@)} \keyword{datasets} \examples{ @PKGNAME@ cls <- columns(@PKGNAME@) cls cls <- cls[c(1,19,45)] kts <- keytypes(@PKGNAME@) kt <- kts[2] kts ks <- head(keys(@PKGNAME@, keytype=kts[2])) ks res <- select(@PKGNAME@, keys=ks, columns=cls, keytype=kt) head(res) } OrganismDbi/inst/doc/0000755000175200017520000000000014136076334015523 5ustar00biocbuildbiocbuildOrganismDbi/inst/doc/OrganismDbi.R0000644000175200017520000000402514136076333020044 0ustar00biocbuildbiocbuild### R code from vignette source 'OrganismDbi.Rnw' ################################################### ### code chunk number 1: columns ################################################### library(Homo.sapiens) columns(Homo.sapiens) ################################################### ### code chunk number 2: keys ################################################### keytypes(Homo.sapiens) ################################################### ### code chunk number 3: keys ################################################### head(keys(Homo.sapiens, keytype="ENTREZID")) ################################################### ### code chunk number 4: select ################################################### k <- head(keys(Homo.sapiens, keytype="ENTREZID"),n=3) select(Homo.sapiens, keys=k, columns=c("TXNAME","SYMBOL"), keytype="ENTREZID") ################################################### ### code chunk number 5: transcripts ################################################### transcripts(Homo.sapiens, columns=c("TXNAME","SYMBOL")) ################################################### ### code chunk number 6: transcriptsBy ################################################### transcriptsBy(Homo.sapiens, by="gene", columns=c("TXNAME","SYMBOL")) ################################################### ### code chunk number 7: setupColData ################################################### gd <- list(join1 = c(GO.db="GOID", org.Hs.eg.db="GO"), join2 = c(org.Hs.eg.db="ENTREZID", TxDb.Hsapiens.UCSC.hg19.knownGene="GENEID")) ################################################### ### code chunk number 8: makeOrganismPackage (eval = FALSE) ################################################### ## destination <- tempfile() ## dir.create(destination) ## makeOrganismPackage(pkgname = "Homo.sapiens", ## graphData = gd, ## organism = "Homo sapiens", ## version = "1.0.0", ## maintainer = "Package Maintainer", ## author = "Some Body", ## destDir = destination, ## license = "Artistic-2.0") OrganismDbi/inst/doc/OrganismDbi.Rnw0000644000175200017520000002076614136050033020410 0ustar00biocbuildbiocbuild%\VignetteIndexEntry{OrganismDbi: A meta framework for Annotation Packages} %\VignetteDepends{Homo.sapiens} \documentclass[11pt]{article} \usepackage{Sweave} \usepackage[usenames,dvipsnames]{color} \usepackage{graphics} \usepackage{latexsym, amsmath, amssymb} \usepackage{authblk} \usepackage[colorlinks=true, linkcolor=Blue, urlcolor=Blue, citecolor=Blue]{hyperref} %% Simple macros \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\file}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsl{#1}} \newcommand\R{\textsl{R}} \newcommand\Bioconductor{\textsl{Bioconductor}} \newcommand\Rpackage[1]{{\textsl{#1}\index{#1 (package)}}} \newcommand\Biocpkg[1]{% {\href{http://bioconductor.org/packages/devel/bioc/html/#1.html}% {\textsl{#1}}}% \index{#1 (package)}} \newcommand\Rpkg[1]{% {\href{http://cran.fhcrc.org/web/devel/#1/index.html}% {\textsl{#1}}}% \index{#1 (package)}} \newcommand\Biocdatapkg[1]{% {\href{http://bioconductor.org/packages/devel/data/experiment/html/#1.html}% {\textsl{#1}}}% \index{#1 (package)}} \newcommand\Robject[1]{{\small\texttt{#1}}} \newcommand\Rclass[1]{{\textit{#1}\index{#1 (class)}}} \newcommand\Rfunction[1]{{{\small\texttt{#1}}\index{#1 (function)}}} \newcommand\Rmethod[1]{{\texttt{#1}}} \newcommand\Rfunarg[1]{{\small\texttt{#1}}} \newcommand\Rcode[1]{{\small\texttt{#1}}} %% Question, Exercise, Solution \usepackage{theorem} \theoremstyle{break} \newtheorem{Ext}{Exercise} \newtheorem{Question}{Question} \newenvironment{Exercise}{ \renewcommand{\labelenumi}{\alph{enumi}.}\begin{Ext}% }{\end{Ext}} \newenvironment{Solution}{% \noindent\textbf{Solution:}\renewcommand{\labelenumi}{\alph{enumi}.}% }{\bigskip} \title{OrganismDbi: A meta framework for Annotation Packages} \author{Marc Carlson} \SweaveOpts{keep.source=TRUE} \begin{document} \maketitle OrganismDbi is a software package that helps tie together different annotation resources. It is expected that users may have previously made seen packages like \Rpackage{org.Hs.eg.db} and \Rpackage{TxDb.Hsapiens.UCSC.hg19.knownGene}. Packages like these two are very different and contain very different kinds of information, but are still about the same organism: Homo sapiens. The \Rpackage{OrganismDbi} package allows us to combine resources like these together into a single package resource, which can represent ALL of these resources at the same time. An example of this is the \Rpackage{homo.sapiens} package, which combines access to the two resources above along with others. This is made possible because the packages that are represented by \Rpackage{homo.sapiens} are related to each other via foreign keys. \begin{figure}[ht] \centering \includegraphics[width=.6\textwidth]{databaseTypes.pdf} \caption{Relationships between Annotation packages} \label{fig:dbtypes} \end{figure} \section{Getting started with OrganismDbi} Usage of a package like this has been deliberately kept very simple. The methods supported are the same ones that work for all the packages based on \Rclass{AnnotationDb} objects. The methods that can be applied to these new packages are \Rmethod{columns}, \Rmethod{keys}, \Rmethod{keytypes} and \Rmethod{select}. So to learn which kinds of data can be retrieved from a package like this we would simply load the package and then call the \Rmethod{columns} method. <>= library(Homo.sapiens) columns(Homo.sapiens) @ To learn which of those kinds of data can be used as keys to extract data, we use the \Rmethod{keytypes} method. <>= keytypes(Homo.sapiens) @ To extract specific keys, we need to use the \Rmethod{keys} method, and also provide it a legitimate keytype: <>= head(keys(Homo.sapiens, keytype="ENTREZID")) @ And to extract data, we can use the \Rmethod{select} method. The select method depends on the values from the previous three methods to specify what it will extract. Here is an example that will extract, UCSC transcript names, and gene symbols using Entrez Gene IDs as keys. <>= k <- head(keys(Homo.sapiens, keytype="ENTREZID"),n=3) select(Homo.sapiens, keys=k, columns=c("TXNAME","SYMBOL"), keytype="ENTREZID") @ In Addition to \Rmethod{select}, some of the more popular range based methods have also been updated to work with an \Rclass{AnnotationDb} object. So for example you could extract transcript information like this: <>= transcripts(Homo.sapiens, columns=c("TXNAME","SYMBOL")) @ And the \Rclass{GRanges} object that would be returned would have the information that you specified in the columns argument. You could also have used the \Rmethod{exons} or \Rmethod{cds} methods in this way. The \Rmethod{transcriptsBy},\Rmethod{exonsBy} and \Rmethod{cdsBy} methods are also supported. For example: <>= transcriptsBy(Homo.sapiens, by="gene", columns=c("TXNAME","SYMBOL")) @ \section{Making your own OrganismDbi packages} So in the preceding section you can see that using an \Rpackage{OrganismDbi} package behaves very similarly to how you might use a \Robject{TxDb} or an \Robject{OrgDb} package. The same methods are defined, and the behave similarly except that they now have access to much more data than before. But before you make your own OrganismDbi package you need to understand that there are few logical limitations for what can be included in this kind of package. \begin{itemize} \item The 1st limitation is that all the annotation resources in question must have implemented the four methods described in the preceding section (\Rmethod{columns}, \Rmethod{keys}, \Rmethod{keytypes} and \Rmethod{select}). \item The 2nd limitation is that you cannot have more than one example of each field that can be retrieved from each type of package that is included. So basically, all values returned by \Rmethod{columns} must be unique across ALL of the supporting packages. \item The 3rd limitation is that you cannot have more than one example of each object type represented. So you cannot have two org packages since that would introduce two \Robject{OrgDb} objects. \item And the 4th limitation is that you cannot have cycles in the graph. What this means is that there will be a graph that represents the relationships between the different object types in your package, and this graph must not present more than one pathway between any two nodes/objects. This limitation means that you can choose one foreign key relationship to connect any two packages in your graph. \end{itemize} With these limitations in mind, lets set up an example. Lets show how we could make Homo.sapiens, such that it allowed access to \Rpackage{org.Hs.eg.db}, \Rpackage{TxDb.Hsapiens.UCSC.hg19.knownGene} and \Rpackage{GO.db}. The 1st thing that we need to do is set up a list that expresses the way that these different packages relate to each other. To do this, we make a \Robject{list} that contains short two element long character vectors. Each character vector represents one relationship between a pair of packages. The names of the vectors are the package names and the values are the foreign keys. Please note that the foreign key values in these vectors are the same strings that are returned by the \Rmethod{columns} method for the individual packages. Here is an example that shows how \Rpackage{GO.db}, \Rpackage{org.Hs.eg.db} and \Rpackage{TxDb.Hsapiens.UCSC.hg19.knownGene} all relate to each other. <>= gd <- list(join1 = c(GO.db="GOID", org.Hs.eg.db="GO"), join2 = c(org.Hs.eg.db="ENTREZID", TxDb.Hsapiens.UCSC.hg19.knownGene="GENEID")) @ So this \Robject{data.frame} indicates both which packages are connected to each other, and also what these connections are using for foreign keys. Once this is finished, we just have to call the \Rfunction{makeOrganismPackage} function to finish the task. <>= destination <- tempfile() dir.create(destination) makeOrganismPackage(pkgname = "Homo.sapiens", graphData = gd, organism = "Homo sapiens", version = "1.0.0", maintainer = "Package Maintainer", author = "Some Body", destDir = destination, license = "Artistic-2.0") @ \Rfunction{makeOrganismPackage} will then generate a lightweight package that you can install. This package will not contain all the data that it refers to, but will instead depend on the packages that were referred to in the \Robject{data.frame}. Because the end result will be a package that treats all the data mapped together as a single source, the user is encouraged to take extra care to ensure that the different packages used are from the same build etc. \end{document} OrganismDbi/vignettes/databaseTypes.pdf0000644000175200017520000000454514136050033021271 0ustar00biocbuildbiocbuild%PDF-1.3 % 1 0 obj <> endobj 2 0 obj <> endobj 3 0 obj <> endobj 4 0 obj <> endobj 5 0 obj <> stream xڵn7 st֥(6^)qf%z]MYsVߐ%QMfxcϏ\۾M,\#O^ďv+(VknG`j:p#`lv?%iv)el ڣeCghX@) gFhi0arhLAj=#YEO`fiode&"u*Z0-MȊ, wtD!Ά.)өlۮ?S_u1cU>Gu}γ{ %ʽrkvz,r&L~{wp~9Tt/{gĦ#mu`+7ucb.pD::ޓNk78*K'?^= sivOpApV߾E##hx=W>s +]9ݑ.`LȧKq»VqEB V'uP~ޛ}8k\8'8 *[}%}t'\pnvq,*7 #ݑ"4ZGKr7q_'ҕ@t&H~VSlp<5v]ZpX`^@j58!5g$*) w,y=AnI1J'{K`tf+{wFb@d9K >) F]]RsT鉈xyυ苍I+w,pϭdSŘwI'nnTC1%x h>*N xSVk-Fntb*7p]6ک; =8ZSj֛$w]hc3O/'$b> 'mZY<.Nٛ|X:?,.@}w* +7= PV7I)2Ca&kdEJI$)NVTbi%GO%q;Ck m[l@݉\OJ 9DCҜ5K9/h*xqrUpތUn W~]`zo%oL6BVӧN?2"anT idaJRR$JNH&_"s1(iwRw&I aMlbcHC;c endstream endobj 7 0 obj <> endobj 8 0 obj <>>> endobj xref 0 9 0000000000 65535 f 0000000015 00000 n 0000000066 00000 n 0000000111 00000 n 0000000283 00000 n 0000000380 00000 n 0000000000 65535 f 0000002008 00000 n 0000002096 00000 n trailer <> startxref 2153 %%EOF