tximport/DESCRIPTION0000644000175400017540000000275413556165765015227 0ustar00biocbuildbiocbuildPackage: tximport Version: 1.14.0 Title: Import and summarize transcript-level estimates for transcript- and gene-level analysis Description: Imports transcript-level abundance, estimated counts and transcript lengths, and summarizes into matrices for use with downstream gene-level analysis packages. Average transcript length, weighted by sample-specific transcript abundance estimates, is provided as a matrix which can be used as an offset for different expression of gene-level counts. Author: Michael Love [cre,aut], Charlotte Soneson [aut], Mark Robinson [aut], Rob Patro [ctb], Andrew Parker Morgan [ctb], Ryan C. Thompson [ctb], Matt Shirley [ctb], Avi Srivastava [ctb] Maintainer: Michael Love License: GPL (>=2) VignetteBuilder: knitr Imports: utils, stats, methods Suggests: knitr, rmarkdown, testthat, tximportData, TxDb.Hsapiens.UCSC.hg19.knownGene, readr (>= 0.2.2), limma, edgeR, csaw, DESeq2 (>= 1.11.6), rhdf5, jsonlite, matrixStats, Matrix, fishpond URL: https://github.com/mikelove/tximport biocViews: DataImport, Preprocessing, RNASeq, Transcriptomics, Transcription, GeneExpression, ImmunoOncology RoxygenNote: 6.1.1 NeedsCompilation: no Encoding: UTF-8 git_url: https://git.bioconductor.org/packages/tximport git_branch: RELEASE_3_10 git_last_commit: 0fc7537 git_last_commit_date: 2019-10-29 Date/Publication: 2019-10-29 Packaged: 2019-10-30 01:48:37 UTC; biocbuild tximport/NAMESPACE0000644000175400017540000000053513556120525014714 0ustar00biocbuildbiocbuild# Generated by roxygen2: do not edit by hand export(makeCountsFromAbundance) export(summarizeToGene) export(tximport) exportMethods(summarizeToGene) importFrom(methods,is) importFrom(stats,median) importFrom(utils,capture.output) importFrom(utils,compareVersion) importFrom(utils,head) importFrom(utils,packageVersion) importFrom(utils,read.delim) tximport/NEWS0000644000175400017540000001077613556120525014204 0ustar00biocbuildbiocbuildCHANGES IN VERSION 1.14.0 -------------------------- o Alevin count and inferential variance can be imported now ~40x faster for large number of cells, leveraging C++ code from the fishpond package (>= 1.1.18). o Alevin inferential replicates can be imported (also sparse). To not import the inferential replicates, set dropInfReps=TRUE. CHANGES IN VERSION 1.11.1 ------------------------- o Added argument 'sparse' and 'sparseThreshold' to allow for sparse count import. For the initial implemenation: only works for txOut=TRUE; countsFromAbundance either "no" or "scaledTPM"; doesn't work with inferential replicates, and only imports counts (and abundances if countsFromAbundance="scaledTPM"). CHANGES IN VERSION 1.9.11 ------------------------- o Exporting simple internal function makeCountsFromAbundance(). CHANGES IN VERSION 1.9.10 ------------------------- o Added 'infRepStat' argument which offers re-compution of counts and abundances using a function applied to the inferential replicates, e.g. matrixStats::rowMedian for using the median of posterior samples as the point estimate provided in "counts" and "abundance". If 'countsFromAbundance' is specified, this will compute counts a second time from the re-computed abundances. CHANGES IN VERSION 1.9.9 ------------------------ o Adding support for gene-level summarization of inferential replicates. This takes place by perform row summarization on the inferential replicate (counts) in the same manner as the original counts (and optionally computing the variance). CHANGES IN VERSION 1.9.6 ------------------------ o Added new countsFromAbundance method: "dtuScaledTPM". This is designed for DTU analysis and to be used with txOut=TRUE. It provides counts that are scaled, with a gene, by the median transcript length among isoforms, then later by the sample's sequencing depth, as in the other two methods. The transcript lengths are calculated by first taking the average across samples. With this new method, all the abundances within a gene across all samples are scaled up by the same length, preserving isoform proportions calculated from the counts. CHANGES IN VERSION 1.9.4 ------------------------ o Made a change to summarizeToGene() that will now provide different output with a warning to alert the user. The case is: if tximport() is run with countsFromAbundance="scaledTPM" or "lengthScaledTPM" and txOut=TRUE, followed by summarizeToGene() with countsFromAbundance="no". This is a problematic series of calls, and previously it was ignoring the fact that the incoming counts are not original counts. Now, summarizeToGene() will throw a warning and override countsFromAbundance="no" to instead set it to the value that was used when tximport was originally run, either "scaledTPM" or "lengthScaledTPM". CHANGES IN VERSION 1.9.1 ------------------------ o Fixed edgeR example code in vignette to use scaleOffset after recommendation from Aaron Lun (2018-05-25). CHANGES IN VERSION 1.8.0 ------------------------ o Added support for StringTie output. CHANGES IN VERSION 1.3.8 ------------------------ o Support for inferential replicates written by Rob Patro! Works for Salmon, Sailfish and kallisto. See details in ?tximport. CHANGES IN VERSION 1.3.6 ------------------------ o Now, 'countsFromAbundance' not ignored when txOut=TRUE. CHANGES IN VERSION 1.3.4 ------------------------ o Support for kallisto HDF5 files thanks to Andrew Parker Morgan and Ryan C Thompson o Removing 'reader' argument, leaving only 'importer' argument. In addition, read_tsv will be used by default if readr package is installed. o Messages from the importing function are captured to avoid screen clutter. CHANGES IN VERSION 0.99.0 ------------------------- o Preparing package for Bioconductor submission. CHANGES IN VERSION 0.0.19 ------------------------- o Added `summarizeToGene` which breaks out the gene-level summary step, so it can be run by users on lists of transcript-level matrices produced by `tximport` with `txOut=TRUE`. CHANGES IN VERSION 0.0.18 ========================= o Changed argument `gene2tx` to `tx2gene`. This order is more intuitive: linking transcripts to genes, and matches the `geneMap` argument of Salmon and Sailfish. tximport/R/0000755000175400017540000000000013556120525013673 5ustar00biocbuildbiocbuildtximport/R/alevin.R0000644000175400017540000002434013556120525015277 0ustar00biocbuildbiocbuildreadAlevinPreV014 <- function(files) { message("using importer for pre-v0.14.0 Alevin files") dir <- sub("/alevin$","",dirname(files)) barcode.file <- file.path(dir, "alevin/quants_mat_rows.txt") gene.file <- file.path(dir, "alevin/quants_mat_cols.txt") matrix.file <- file.path(dir, "alevin/quants_mat.gz") var.file <- file.path(dir, "alevin/quants_var_mat.gz") for (f in c(barcode.file, gene.file, matrix.file)) { if (!file.exists(f)) { stop("expecting 'files' to point to 'quants_mat.gz' file in a directory 'alevin' also containing 'quants_mat_rows.txt' and 'quant_mat_cols.txt'. please re-run alevin preserving output structure") } } cell.names <- readLines(barcode.file) gene.names <- readLines(gene.file) num.cells <- length(cell.names) num.genes <- length(gene.names) mat <- matrix(nrow=num.genes, ncol=num.cells, dimnames=list(gene.names, cell.names)) message("reading in alevin gene-level counts across cells") con <- gzcon(file(matrix.file, "rb")) for (j in seq_len(num.cells)) { mat[,j] <- readBin(con, double(), endian = "little", n=num.genes) } close(con) # if inferential replicate variance exists: if (file.exists(var.file)) { counts.mat <- mat var.mat <- mat message("reading in alevin inferential variance") con <- gzcon(file(var.file, "rb")) for (j in seq_len(num.cells)) { var.mat[,j] <- readBin(con, double(), endian = "little", n=num.genes) } close(con) mat <- list(counts=counts.mat, variance=var.mat) } mat } readAlevin <- function(files, dropInfReps, forceSlow) { dir <- sub("/alevin$","",dirname(files)) barcode.file <- file.path(dir, "alevin/quants_mat_rows.txt") gene.file <- file.path(dir, "alevin/quants_mat_cols.txt") matrix.file <- file.path(dir, "alevin/quants_mat.gz") var.file <- file.path(dir, "alevin/quants_var_mat.gz") boot.file <- file.path(dir, "alevin/quants_boot_mat.gz") for (f in c(barcode.file, gene.file, matrix.file)) { if (!file.exists(f)) { stop("expecting 'files' to point to 'quants_mat.gz' file in a directory 'alevin' also containing 'quants_mat_rows.txt' and 'quant_mat_cols.txt'. please re-run alevin preserving output structure") } } cell.names <- readLines(barcode.file) gene.names <- readLines(gene.file) if (!requireNamespace("jsonlite", quietly=TRUE)) { stop("importing alevin requires package `jsonlite`") } jsonPath <- file.path(dir, "cmd_info.json") cmd_info <- jsonlite::fromJSON(jsonPath) if ("numCellBootstraps" %in% names(cmd_info)) { num.boot <- as.numeric(cmd_info$numCellBootstraps) } else { num.boot <- 0 } if (!requireNamespace("Matrix", quietly=TRUE)) { stop("importing alevin requires package `Matrix`") } # test for fishpond >= 1.1.17 hasFishpond <- TRUE if (!requireNamespace("fishpond", quietly=TRUE)) { hasFishpond <- FALSE } else { if (packageVersion("fishpond") < "1.1.18") { hasFishpond <- FALSE } } if (!hasFishpond) message("importing alevin data is much faster after installing `fishpond` (>= 1.1.18)") # for testing purposes, force the use of the slower R code for importing alevin if (forceSlow) { hasFishpond <- FALSE } extraMsg <- if (hasFishpond) "with fishpond" else "" message(paste("reading in alevin gene-level counts across cells", extraMsg)) if (hasFishpond) { # reads alevin's Efficient Data Storage (EDS) format # using C++ code in the fishpond package mat <- readAlevinFast(matrix.file, gene.names, cell.names) } else { # reads alevin EDS format in R, using e.g. `readBin` and `intToBits` # slow in R, because requires looping over cells to read positions and expression mat <- readAlevinBits(matrix.file, gene.names, cell.names) } if (num.boot > 0) { message(paste("reading in alevin inferential variance", extraMsg)) var.exists <- file.exists(var.file) boot.exists <- file.exists(boot.file) stopifnot(var.exists) if (hasFishpond) { var.mat <- readAlevinFast(var.file, gene.names, cell.names) } else { var.mat <- readAlevinBits(var.file, gene.names, cell.names) } if (boot.exists & !dropInfReps) { # read in bootstrap inferential replicates message("reading in alevin inferential replicates (set dropInfReps=TRUE to skip)") infReps <- readAlevinInfReps(boot.file, gene.names, cell.names, num.boot) return(list(counts=mat, variance=var.mat, infReps=infReps)) } else { # return counts and variance return(list(counts=mat, variance=var.mat)) } } else { return(mat) } } getAlevinVersion <- function(files) { if (!requireNamespace("jsonlite", quietly=TRUE)) { stop("importing Alevin quantification requires package `jsonlite`") } fish_dir <- dirname(dirname(files)) jsonPath <- file.path(fish_dir, "cmd_info.json") cmd_info <- jsonlite::fromJSON(jsonPath) cmd_info$salmon_version } # this is the R version of the reader for alevin's EDS format, # see below for another function that leverages C++ code from fishpond::readEDS() readAlevinBits <- function(matrix.file, gene.names, cell.names) { num.cells <- length(cell.names) num.genes <- length(gene.names) len.bit.vec <- ceiling(num.genes/8) # the bit vector matrix is 8 rows x num.genes/8 columns # and stores the positions of the non-zero counts bits.mat <- matrix(nrow=8, ncol=num.cells * len.bit.vec) con <- gzcon(file(matrix.file, "rb")) for (j in seq_len(num.cells)) { # read the bit vectors ints <- readBin(con, integer(), size=1, signed=FALSE, endian="little", n=len.bit.vec) bits <- matrix(intToBits(ints), nrow=32) mode(bits) <- "integer" # 8 to 1, because intToBits gives the least sig bit first bits <- bits[8:1,] num.exp.genes <- sum(bits == 1) # store bits in the matrix idx <- (j-1) * len.bit.vec + seq_len(len.bit.vec) bits.mat[,idx] <- bits # read in counts, but don't store counts <- readBin(con, double(), size=4, endian="little", n=num.exp.genes) } close(con) con <- gzcon(file(matrix.file, "rb")) # stores all of the non-zero counts from all cells concatenated counts.vec <- numeric(sum(bits.mat)) ptr <- 0 for (j in seq_len(num.cells)) { # read in bit vectors, but don't store ints <- readBin(con, integer(), size=1, signed=FALSE, endian="little", n=len.bit.vec) bit.idx <- (j-1) * len.bit.vec + seq_len(len.bit.vec) num.exp.genes <- sum(bits.mat[,bit.idx]) cts.idx <- ptr + seq_len(num.exp.genes) counts.vec[cts.idx] <- readBin(con, double(), size=4, endian="little", n=num.exp.genes) ptr <- ptr + num.exp.genes } close(con) gene.idx <- lapply(seq_len(num.cells), function(j) { idx <- (j-1) * len.bit.vec + seq_len(len.bit.vec) which(head(as.vector(bits.mat[,idx]), num.genes) == 1) }) len.gene.idx <- lengths(gene.idx) cell.idx <- rep(seq_along(len.gene.idx), len.gene.idx) # build sparse matrix mat <- Matrix::sparseMatrix(i=unlist(gene.idx), j=cell.idx, x=counts.vec, dims=c(num.genes, num.cells), dimnames=list(gene.names, cell.names), giveCsparse=FALSE) mat } # this function performs the same operation as the above R code, # reading in alevin's EDS format and creating a sparse matrix, # but it leverages the C++ code in fishpond::readEDS() readAlevinFast <- function(matrix.file, gene.names, cell.names) { num.cells <- length(cell.names) num.genes <- length(gene.names) mat <- fishpond::readEDS(num.genes, num.cells, matrix.file) dimnames(mat) <- list(gene.names, cell.names) mat } readAlevinInfReps <- function(boot.file, gene.names, cell.names, num.boot) { num.cells <- length(cell.names) num.genes <- length(gene.names) len.bit.vec <- ceiling(num.genes/8) # a list to store the bit vector matrices # each element of the list is for an inf rep # (for description of `bits.mat` see the readAlevinBits() function) bits.mat.list <- lapply(seq_len(num.boot), function(i) matrix(nrow=8, ncol=num.cells * len.bit.vec)) con <- gzcon(file(boot.file, "rb")) for (j in seq_len(num.cells)) { for (i in seq_len(num.boot)) { # read the bit vectors ints <- readBin(con, integer(), size=1, signed=FALSE, endian="little", n=len.bit.vec) bits <- matrix(intToBits(ints), nrow=32) mode(bits) <- "integer" # 8 to 1, because intToBits gives the least sig bit first bits <- bits[8:1,] num.exp.genes <- sum(bits == 1) # store bits in the matrix idx <- (j-1) * len.bit.vec + seq_len(len.bit.vec) bits.mat.list[[i]][,idx] <- bits # read in counts, but don't store counts <- readBin(con, double(), size=4, endian="little", n=num.exp.genes) } } close(con) con <- gzcon(file(boot.file, "rb")) # as with `bits.mat.list`, `counts.vec.list` is a list # over inferential replicates, each one is a counts vector # storing all of the non-zero counts from all cells concatenated counts.vec.list <- lapply(seq_len(num.boot), function(i) numeric(sum(bits.mat.list[[i]]))) ptr <- numeric(num.boot) for (j in seq_len(num.cells)) { for (i in seq_len(num.boot)) { # read in bit vectors, but don't store ints <- readBin(con, integer(), size=1, signed=FALSE, endian="little", n=len.bit.vec) bit.idx <- (j-1) * len.bit.vec + seq_len(len.bit.vec) num.exp.genes <- sum(bits.mat.list[[i]][,bit.idx]) cts.idx <- ptr[i] + seq_len(num.exp.genes) counts.vec.list[[i]][cts.idx] <- readBin(con, double(), size=4, endian="little", n=num.exp.genes) ptr[i] <- ptr[i] + num.exp.genes } } close(con) infReps <- lapply(seq_len(num.boot), function(i) { gene.idx <- lapply(seq_len(num.cells), function(j) { idx <- (j-1) * len.bit.vec + seq_len(len.bit.vec) which(head(as.vector(bits.mat.list[[i]][,idx]), num.genes) == 1) }) len.gene.idx <- lengths(gene.idx) cell.idx <- rep(seq_along(len.gene.idx), len.gene.idx) z <- Matrix::sparseMatrix(i=unlist(gene.idx), j=cell.idx, x=counts.vec.list[[i]], dims=c(num.genes, num.cells), dimnames=list(gene.names, cell.names), giveCsparse=FALSE) }) infReps } tximport/R/helper.R0000644000175400017540000001052613556120525015301 0ustar00biocbuildbiocbuild#' Low-level function to make counts from abundance using matrices #' #' Simple low-level function used within \link{tximport} to generate #' \code{scaledTPM} or \code{lengthScaledTPM} counts, taking as input #' the original counts, abundance and length matrices. #' NOTE: This is a low-level function exported in case it is needed for some reason, #' but the recommended way to generate counts-from-abundance is using #' \link{tximport} with the \code{countsFromAbundance} argument. #' #' @param countsMat a matrix of original counts #' @param abundanceMat a matrix of abundances (typically TPM) #' @param lengthMat a matrix of effective lengths #' @param countsFromAbundance the desired type of count-from-abundance output #' #' @return a matrix of count-scale data generated from abundances. #' for details on the calculation see \link{tximport}. #' #' @export makeCountsFromAbundance <- function(countsMat, abundanceMat, lengthMat, countsFromAbundance=c("scaledTPM","lengthScaledTPM")) { countsFromAbundance <- match.arg(countsFromAbundance) sparse <- is(countsMat, "dgCMatrix") colsumfun <- if (sparse) Matrix::colSums else colSums countsSum <- colsumfun(countsMat) if (countsFromAbundance == "lengthScaledTPM") { newCounts <- abundanceMat * rowMeans(lengthMat) } else if (countsFromAbundance == "scaledTPM") { newCounts <- abundanceMat } else { stop("expecting 'lengthScaledTPM' or 'scaledTPM'") } newSum <- colsumfun(newCounts) if (sparse) { countsMat <- Matrix::t(Matrix::t(newCounts) * (countsSum/newSum)) } else { countsMat <- t(t(newCounts) * (countsSum/newSum)) } countsMat } # function for replacing missing average transcript length values replaceMissingLength <- function(lengthMat, aveLengthSampGene) { nanRows <- which(apply(lengthMat, 1, function(row) any(is.nan(row)))) if (length(nanRows) > 0) { for (i in nanRows) { if (all(is.nan(lengthMat[i,]))) { # if all samples have 0 abundances for all tx, use the simple average lengthMat[i,] <- aveLengthSampGene[i] } else { # otherwise use the geometric mean of the lengths from the other samples idx <- is.nan(lengthMat[i,]) lengthMat[i,idx] <- exp(mean(log(lengthMat[i,!idx]), na.rm=TRUE)) } } } lengthMat } medianLengthOverIsoform <- function(length, tx2gene, ignoreTxVersion, ignoreAfterBar) { txId <- rownames(length) if (ignoreTxVersion) { txId <- sub("\\..*", "", txId) } else if (ignoreAfterBar) { txId <- sub("\\|.*", "", txId) } tx2gene <- cleanTx2Gene(tx2gene) stopifnot(all(txId %in% tx2gene$tx)) tx2gene <- tx2gene[match(txId, tx2gene$tx),] # average the lengths ave.len <- rowMeans(length) # median over isoforms med.len <- tapply(ave.len, tx2gene$gene, median) one.sample <- med.len[match(tx2gene$gene, names(med.len))] matrix(rep(one.sample, ncol(length)), ncol=ncol(length), dimnames=dimnames(length)) } # code contributed from Andrew Morgan read_kallisto_h5 <- function(fpath, ...) { if (!requireNamespace("rhdf5", quietly=TRUE)) { stop("reading kallisto results from hdf5 files requires Bioconductor package `rhdf5`") } counts <- rhdf5::h5read(fpath, "est_counts") ids <- rhdf5::h5read(fpath, "aux/ids") efflens <- rhdf5::h5read(fpath, "aux/eff_lengths") # as suggested by https://support.bioconductor.org/p/96958/#101090 ids <- as.character(ids) stopifnot(length(counts) == length(ids)) stopifnot(length(efflens) == length(ids)) result <- data.frame(target_id = ids, eff_length = efflens, est_counts = counts, stringsAsFactors = FALSE) normfac <- with(result, (1e6)/sum(est_counts/eff_length)) result$tpm <- with(result, normfac*(est_counts/eff_length)) return(result) } summarizeFail <- function() { stop(" tximport failed at summarizing to the gene-level. Please see 'Solutions' in the Details section of the man page: ?tximport ") } # this is much faster than by(), a bit slower than dplyr summarize_each() ## fastby <- function(m, f, fun) { ## idx <- split(1:nrow(m), f) ## if (ncol(m) > 1) { ## t(sapply(idx, function(i) fun(m[i,,drop=FALSE]))) ## } else { ## matrix(vapply(idx, function(i) fun(m[i,,drop=FALSE], FUN.VALUE=numeric(ncol(m)))), ## dimnames=list(levels(f), colnames(m))) ## } ## } tximport/R/infReps.R0000644000175400017540000000677613556120525015444 0ustar00biocbuildbiocbuild# from http://stackoverflow.com/questions/25099825/row-wise-variance-of-a-matrix-in-r rowVars <- function(x) { rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1) } # Read inferential replicate information from salmon / sailfish # # SF ver >= 0.9.0, Salmon ver >= 0.8.0 # # fish_dir = path to a sailfish output directory readInfRepFish <- function(fish_dir, meth) { # aux_info is the default auxiliary directory in salmon # aux is the default directory in sailfish aux_dir <- "aux_info" if (!requireNamespace("jsonlite", quietly=TRUE)) { stop("importing inferential replicates for Salmon or Sailfish requires package `jsonlite`. to skip this step, set dropInfReps=TRUE") } # if the default is overwritten, then use that instead jsonPath <- file.path(fish_dir, "cmd_info.json") if (!file.exists(jsonPath)) { return(NULL) } cmd_info <- jsonlite::fromJSON(jsonPath) if (is.element("auxDir", names(cmd_info))) { aux_dir <- cmd_info$auxDir } auxPath <- file.path(fish_dir, aux_dir) if (!file.exists(auxPath)) { return(NULL) } # get all of the meta info minfo <- jsonlite::fromJSON(file.path(auxPath, "meta_info.json")) if ("salmon_version" %in% names(minfo)) { stopifnot(package_version(minfo$salmon_version) >= "0.8.0") } if ("sailfish_version" %in% names(minfo)) { stopifnot(package_version(minfo$sailfish_version) >= "0.9.0") } sampType <- NULL # check if we have explicitly recorded the type of posterior sample # (salmon >= 0.7.3) if (is.element("samp_type", names(minfo))) { sampType <- minfo$samp_type } # load bootstrap data if it exists knownSampleTypes <- c("gibbs", "bootstrap") numBoot <- minfo$num_bootstraps if (numBoot > 0) { bootCon <- gzcon(file(file.path(auxPath, 'bootstrap', 'bootstraps.gz'), "rb")) ## # Gibbs samples *used* to be integers, and bootstraps were doubles # Now, however, both types of samples are doubles. The code below # tries to load doubles first, but falls back to integers if it fails. ## if("num_valid_targets" %in% names(minfo)) { minfo$num_targets = minfo$num_valid_targets } expected.n <- minfo$num_targets * minfo$num_bootstraps boots <- tryCatch({ bootsIn <- readBin(bootCon, "double", n = expected.n) stopifnot(length(bootsIn) == expected.n) bootsIn }, error=function(...) { # close and re-open the connection to reset the file close(bootCon) bootCon <- gzcon(file(file.path(auxPath, 'bootstrap', 'bootstraps.gz'), "rb")) readBin(bootCon, "integer", n = expected.n) }) close(bootCon) # rows are transcripts, columns are bootstraps dim(boots) <- c(minfo$num_targets, minfo$num_bootstraps) vars <- rowVars(boots) return(list(vars=vars, reps=boots)) } else { return(NULL) } } readInfRepKallisto <- function(bear_dir) { h5File <- file.path(bear_dir, "abundance.h5") if (!file.exists(h5File)) return(NULL) if (!requireNamespace("rhdf5", quietly=TRUE)) { stop("reading kallisto results from hdf5 files requires Bioconductor package `rhdf5`") } groups <- rhdf5::h5ls(h5File) numBoot <- length(groups$group[groups$group == "/bootstrap"]) if (numBoot > 0) { boots <- rhdf5::h5read(file.path(bear_dir, "abundance.h5"), "/bootstrap") numTxp <- length(boots$bs0) bootMat <- matrix(nrow=numTxp, ncol=numBoot) for (bsn in seq_len(numBoot)) { bootMat[,bsn] <- boots[bsn][[1]] } vars <- rowVars(bootMat) return(list(vars=vars, reps=bootMat)) } else { return(NULL) } } tximport/R/summarizeToGene.R0000644000175400017540000001601613556120525017140 0ustar00biocbuildbiocbuild#' @rdname summarizeToGene #' @export setGeneric("summarizeToGene", function(object, ...) standardGeneric("summarizeToGene")) summarizeToGene.list <- function(object, tx2gene, varReduce=FALSE, ignoreTxVersion=FALSE, ignoreAfterBar=FALSE, countsFromAbundance=c("no","scaledTPM","lengthScaledTPM") ) { countsFromAbundance <- match.arg(countsFromAbundance, c("no","scaledTPM","lengthScaledTPM")) if (!is.null(object$countsFromAbundance)) { if (countsFromAbundance == "no" & object$countsFromAbundance != "no") { warning(paste0("the incoming counts have countsFromAbundance = '", object$countsFromAbundance,"', and so the original counts are no longer accessible. to use countsFromAbundance='no', re-run objectmport() with this setting. over-riding 'countsFromAbundance' to set it to: ", object$countsFromAbundance)) countsFromAbundance <- object$countsFromAbundance } } # unpack matrices from list for cleaner code abundanceMatTx <- object$abundance countsMatTx <- object$counts lengthMatTx <- object$length txId <- rownames(abundanceMatTx) stopifnot(all(txId == rownames(countsMatTx))) stopifnot(all(txId == rownames(lengthMatTx))) # need to associate tx to genes # potentially remove unassociated transcript rows and warn user if (!is.null(tx2gene)) { # code to strip dots or bars and all remaining chars from the rownames of matrices if (ignoreTxVersion) { txId <- sub("\\..*", "", txId) } else if (ignoreAfterBar) { txId <- sub("\\|.*", "", txId) } tx2gene <- cleanTx2Gene(tx2gene) # if none of the rownames of the matrices (txId) are # in the tx2gene table something is wrong if (!any(txId %in% tx2gene$tx)) { txFromFile <- paste0("Example IDs (file): [", paste(head(txId,3),collapse=", "),", ...]") txFromTable <- paste0("Example IDs (tx2gene): [", paste(head(tx2gene$tx,3),collapse=", "),", ...]") stop(paste0(" None of the transcripts in the quantification files are present in the first column of tx2gene. Check to see that you are using the same annotation for both.\n\n",txFromFile,"\n\n",txFromTable, "\n\n This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.\n\n")) } # remove transcripts (and genes) not in the rownames of matrices tx2gene <- tx2gene[tx2gene$tx %in% txId,] tx2gene$gene <- droplevels(tx2gene$gene) ntxmissing <- sum(!txId %in% tx2gene$tx) if (ntxmissing > 0) message("transcripts missing from tx2gene: ", ntxmissing) # subset to transcripts in the tx2gene table sub.idx <- txId %in% tx2gene$tx abundanceMatTx <- abundanceMatTx[sub.idx,,drop=FALSE] countsMatTx <- countsMatTx[sub.idx,,drop=FALSE] lengthMatTx <- lengthMatTx[sub.idx,,drop=FALSE] txId <- txId[sub.idx] # now create a vector of geneId which aligns to the matrices geneId <- tx2gene$gene[match(txId, tx2gene$tx)] } # summarize abundance and counts message("summarizing abundance") abundanceMat <- rowsum(abundanceMatTx, geneId) message("summarizing counts") countsMat <- rowsum(countsMatTx, geneId) message("summarizing length") if ("infReps" %in% names(object)) { infReps <- lapply(object$infReps, function(x) rowsum(x[sub.idx,,drop=FALSE], geneId)) message("summarizing inferential replicates") } # the next lines calculate a weighted average of transcript length, # weighting by transcript abundance. # this can be used as an offset / normalization factor which removes length bias # for the differential analysis of estimated counts summarized at the gene level. weightedLength <- rowsum(abundanceMatTx * lengthMatTx, geneId) lengthMat <- weightedLength / abundanceMat # pre-calculate a simple average transcript length # for the case the abundances are all zero for all samples. # first, average the tx lengths over samples aveLengthSamp <- rowMeans(lengthMatTx) # then simple average of lengths within genes (not weighted by abundance) aveLengthSampGene <- tapply(aveLengthSamp, geneId, mean) stopifnot(all(names(aveLengthSampGene) == rownames(lengthMat))) # check for NaN and if possible replace these values with geometric mean of other samples. # (the geometic mean here implies an offset of 0 on the log scale) # NaN come from samples which have abundance of 0 for all isoforms of a gene, and # so we cannot calculate the weighted average. our best guess is to use the average # transcript length from the other samples. lengthMat <- replaceMissingLength(lengthMat, aveLengthSampGene) if (countsFromAbundance != "no") { countsMat <- makeCountsFromAbundance(countsMat, abundanceMat, lengthMat, countsFromAbundance) } if ("infReps" %in% names(object)) { if (varReduce) { vars <- sapply(infReps, rowVars) out <- list(abundance=abundanceMat, counts=countsMat, variance=vars, length=lengthMat, countsFromAbundance=countsFromAbundance) } else { out <- list(abundance=abundanceMat, counts=countsMat, infReps=infReps, length=lengthMat, countsFromAbundance=countsFromAbundance) } } else { out <- list(abundance=abundanceMat, counts=countsMat, length=lengthMat, countsFromAbundance=countsFromAbundance) } return(out) } #' Summarize estimated quantitites to gene-level #' #' Summarizes abundances, counts, lengths, (and inferential #' replicates or variance) from transcript- to gene-level. #' #' @param object the list of matrices of trancript-level abundances, #' counts, lengths produced by \code{\link{tximport}}, #' with a \code{countsFromAbundance} element that tells #' how the counts were generated. #' @param tx2gene see \code{\link{tximport}} #' @param varReduce see \code{\link{tximport}} #' @param ignoreTxVersion see \code{\link{tximport}} #' @param ignoreAfterBar see \code{\link{tximport}} #' @param countsFromAbundance see \code{\link{tximport}} #' @param ... additional arguments, ignored #' #' @return a list of matrices of gene-level abundances, counts, lengths, #' (and inferential replicates or variance if inferential replicates #' are present). #' #' @rdname summarizeToGene #' @docType methods #' @aliases summarizeToGene,list-method #' #' @seealso \code{\link{tximport}} #' #' @export setMethod("summarizeToGene", signature(object="list"), summarizeToGene.list) cleanTx2Gene <- function(tx2gene) { colnames(tx2gene) <- c("tx","gene") if (any(duplicated(tx2gene$tx))) { message("removing duplicated transcript rows from tx2gene") tx2gene <- tx2gene[!duplicated(tx2gene$tx),] } tx2gene$gene <- factor(tx2gene$gene) tx2gene$tx <- factor(tx2gene$tx) tx2gene } tximport/R/tximport.R0000644000175400017540000005771613556120525015724 0ustar00biocbuildbiocbuild#' Import transcript-level abundances and counts #' for transcript- and gene-level analysis packages #' #' \code{tximport} imports transcript-level estimates from various #' external software and optionally summarizes abundances, counts, #' and transcript lengths #' to the gene-level (default) or outputs transcript-level matrices #' (see \code{txOut} argument). #' #' \code{tximport} will also load in information about inferential replicates -- #' a list of matrices of the Gibbs samples from the posterior, or bootstrap replicates, #' per sample -- if these data are available in the expected locations relative to the \code{files}. #' The inferential replicates, stored in \code{infReps} in the output list, #' are on estimated counts, and therefore follow \code{counts} in the output list. #' By setting \code{varReduce=TRUE}, the inferential replicate matrices #' will be replaced by a single matrix with the sample variance per transcript/gene and per sample. #' #' While \code{tximport} summarizes to the gene-level by default, #' the user can also perform the import and summarization steps manually, #' by specifing \code{txOut=TRUE} and then using the function \code{summarizeToGene}. #' Note however that this is equivalent to \code{tximport} with \code{txOut=FALSE} (the default). #' #' Solutions to the error "tximport failed at summarizing to the gene-level": #' #' \enumerate{ #' \item provide a \code{tx2gene} data.frame linking transcripts to genes (more below) #' \item avoid gene-level summarization by specifying \code{txOut=TRUE} #' \item set \code{geneIdCol} to an appropriate column in the files #' } #' #' See \code{vignette('tximport')} for example code for generating a #' \code{tx2gene} data.frame from a \code{TxDb} object. #' Note that the \code{keys} and \code{select} functions used #' to create the \code{tx2gene} object are documented #' in the man page for \link[AnnotationDbi]{AnnotationDb-class} objects #' in the AnnotationDbi package (TxDb inherits from AnnotationDb). #' For further details on generating TxDb objects from various inputs #' see \code{vignette('GenomicFeatures')} from the GenomicFeatures package. #' #' For \code{type="alevin"} all arguments other than \code{files}, #' \code{dropInfReps}, and \code{forceSlow} are ignored, #' and \code{files} should point to a single \code{quants_mat.gz} file, #' in the directory structure created by the alevin software #' (e.g. do not move the file or delete the other important files). #' Note that importing alevin quantifications will be much faster by first #' installing the \code{fishpond} package, which contains a #' C++ importer for alevin's EDS format. #' For alevin, \code{tximport} is importing the gene-by-cell matrix of counts, #' as \code{txi$counts}, and effective lengths are not estimated. #' \code{txi$variance} may also be imported if inferential replicates were #' used, as well as inferential replicates if these were output by alevin. #' Length correction should not be applied to datasets where there #' is not an expected correlation of counts and feature length. #' #' @param files a character vector of filenames for the transcript-level abundances #' @param type character, the type of software used to generate the abundances. #' Options are "salmon", "sailfish", "alevin", "kallisto", "rsem", "stringtie", or "none". #' This argument is used to autofill the arguments below (geneIdCol, etc.) #' "none" means that the user will specify these columns. #' @param txIn logical, whether the incoming files are transcript level (default TRUE) #' @param txOut logical, whether the function should just output #' transcript-level (default FALSE) #' @param countsFromAbundance character, either "no" (default), "scaledTPM", #' "lengthScaledTPM", or "dtuScaledTPM". #' Whether to generate estimated counts using abundance estimates: #' \itemize{ #' \item scaled up to library size (scaledTPM), #' \item scaled using the average transcript length over samples #' and then the library size (lengthScaledTPM), or #' \item scaled using the median transcript length among isoforms of a gene, #' and then the library size (dtuScaledTPM). #' } #' dtuScaledTPM is designed for DTU analysis in combination with \code{txOut=TRUE}, #' and it requires specifing a \code{tx2gene} data.frame. #' dtuScaledTPM works such that within a gene, values from all samples and #' all transcripts get scaled by the same fixed median transcript length. #' If using scaledTPM, lengthScaledTPM, or geneLengthScaledTPM, #' the counts are no longer correlated across samples with transcript length, #' and so the length offset matrix should not be used. #' @param tx2gene a two-column data.frame linking transcript id (column 1) #' to gene id (column 2). #' the column names are not relevant, but this column order must be used. #' this argument is required for gene-level summarization for methods #' that provides transcript-level estimates only #' (kallisto, Salmon, Sailfish) #' @param varReduce whether to reduce per-sample inferential replicates #' information into a matrix of sample variances \code{variance} (default FALSE). #' alevin computes inferential variance by default for bootstrap #' inferential replicates, so this argument is ignored/not necessary #' @param dropInfReps whether to skip reading in inferential replicates #' (default FALSE). For alevin, \code{tximport} will still read in the #' inferential variance matrix if it exists #' @param infRepStat a function to re-compute counts and abundances from the #' inferential replicates, e.g. \code{matrixStats::rowMedians} to re-compute counts #' as the median of the inferential replicates. The order of operations is: #' first counts are re-computed, then abundances are re-computed. #' Following this, if \code{countsFromAbundance} is not "no", #' \code{tximport} will again re-compute counts from the re-computed abundances. #' \code{infRepStat} should operate on rows of a matrix. (default is NULL) #' @param ignoreTxVersion logical, whether to split the tx id on the '.' character #' to remove version information to facilitate matching with the tx id in \code{tx2gene} #' (default FALSE) #' @param ignoreAfterBar logical, whether to split the tx id on the '|' character #' to facilitate matching with the tx id in \code{tx2gene} (default FALSE) #' @param geneIdCol name of column with gene id. if missing, the \code{tx2gene} argument can be used #' @param txIdCol name of column with tx id #' @param abundanceCol name of column with abundances (e.g. TPM or FPKM) #' @param countsCol name of column with estimated counts #' @param lengthCol name of column with feature length information #' @param importer a function used to read in the files #' @param existenceOptional logical, should tximport not check if files exist before attempting #' import (default FALSE, meaning files must exist according to \code{file.exists}) #' @param sparse logical, whether to try to import data sparsely (default is FALSE). #' Initial implementation for \code{txOut=TRUE}, \code{countsFromAbundance="no"} #' or \code{"scaledTPM"}, no inferential replicates. Only counts matrix #' is returned (and abundance matrix if using \code{"scaledTPM"}) #' @param sparseThreshold the minimum threshold for including a count as a #' non-zero count during sparse import (default is 1) #' @param readLength numeric, the read length used to calculate counts from #' StringTie's output of coverage. Default value (from StringTie) is 75. #' The formula used to calculate counts is: #' \code{cov * transcript length / read length} #' @param forceSlow logical, argument used for testing. Will force the use of #' the slower R code for importing alevin, even if \code{fishpond} #' library is installed. Default is FALSE #' #' @return a simple list containing matrices: abundance, counts, length. #' Another list element 'countsFromAbundance' carries through #' the character argument used in the tximport call. #' The length matrix contains the average transcript length for each #' gene which can be used as an offset for gene-level analysis. #' If detected, and \code{txOut=TRUE}, inferential replicates for #' each sample will be imported and stored as a list of matrices, #' itself an element \code{infReps} in the returned list. #' An exception is alevin, in which the \code{infReps} are a list #' of bootstrap replicate matrices, where each matrix has #' genes as rows and cells as columns. #' If \code{varReduce=TRUE} the inferential replicates will be summarized #' according to the sample variance, and stored as a matrix \code{variance}. #' alevin already computes the variance of the bootstrap inferential replicates #' and so this is imported without needing to specify \code{varReduce=TRUE} #' (note that alevin uses the 1/N variance estimator, so not the same as \code{var}). #' #' @references #' #' Charlotte Soneson, Michael I. Love, Mark D. Robinson (2015): #' Differential analyses for RNA-seq: transcript-level estimates #' improve gene-level inferences. F1000Research. #' \url{http://dx.doi.org/10.12688/f1000research.7563.1} #' #' @examples #' #' # load data for demonstrating tximport #' # note that the vignette shows more examples #' # including how to read in files quickly using the readr package #' #' library(tximportData) #' dir <- system.file("extdata", package="tximportData") #' samples <- read.table(file.path(dir,"samples.txt"), header=TRUE) #' files <- file.path(dir,"salmon", samples$run, "quant.sf.gz") #' names(files) <- paste0("sample",1:6) #' #' # tx2gene links transcript IDs to gene IDs for summarization #' tx2gene <- read.csv(file.path(dir, "tx2gene.gencode.v27.csv")) #' #' txi <- tximport(files, type="salmon", tx2gene=tx2gene) #' #' @importFrom utils read.delim capture.output head compareVersion packageVersion #' @importFrom stats median #' @importFrom methods is #' #' @export tximport <- function(files, type=c("none","salmon","sailfish","alevin","kallisto","rsem","stringtie"), txIn=TRUE, txOut=FALSE, countsFromAbundance=c("no","scaledTPM","lengthScaledTPM","dtuScaledTPM"), tx2gene=NULL, varReduce=FALSE, dropInfReps=FALSE, infRepStat=NULL, ignoreTxVersion=FALSE, ignoreAfterBar=FALSE, geneIdCol, txIdCol, abundanceCol, countsCol, lengthCol, importer=NULL, existenceOptional=FALSE, sparse=FALSE, sparseThreshold=1, readLength=75, forceSlow=FALSE) { # inferential replicate importer infRepImporter <- NULL type <- match.arg(type) countsFromAbundance <- match.arg(countsFromAbundance) if (countsFromAbundance == "dtuScaledTPM") { stopifnot(txOut) if (is.null(tx2gene)) stop("'dtuScaledTPM' requires 'tx2gene' input") } if (!existenceOptional) stopifnot(all(file.exists(files))) if (!txIn & txOut) stop("txOut only an option when transcript-level data is read in (txIn=TRUE)") stopifnot(length(files) > 0) kallisto.h5 <- basename(files[1]) == "abundance.h5" if (type == "kallisto" & !kallisto.h5) { message("Note: importing `abundance.h5` is typically faster than `abundance.tsv`") } if (type=="rsem" & txIn & grepl("genes", files[1])) { message("It looks like you are importing RSEM genes.results files, setting txIn=FALSE") txIn <- FALSE } # special alevin code if (type=="alevin") { if (length(files) > 1) stop("alevin import currently only supports a single experiment") vrsn <- getAlevinVersion(files) compareToV014 <- compareVersion(vrsn, "0.14.0") if (compareToV014 == -1) { mat <- readAlevinPreV014(files) } else { mat <- readAlevin(files, dropInfReps, forceSlow) } if (!is.list(mat)) { # only counts txi <- list(abundance=NULL, counts=mat, length=NULL, countsFromAbundance="no") } else { if ("infReps" %in% names(mat)) { # counts + variance + infReps txi <- list(abundance=NULL, counts=mat$counts, variance=mat$variance, infReps=mat$infReps, length=NULL, countsFromAbundance="no") } else { # counts + variance txi <- list(abundance=NULL, counts=mat$counts, variance=mat$variance, length=NULL, countsFromAbundance="no") } } return(txi) } readrStatus <- FALSE if (is.null(importer) & !kallisto.h5) { if (!requireNamespace("readr", quietly=TRUE)) { message("reading in files with read.delim (install 'readr' package for speed up)") importer <- read.delim } else { message("reading in files with read_tsv") readrStatus <- TRUE } } # salmon/sailfish presets if (type %in% c("salmon","sailfish")) { txIdCol <- "Name" abundanceCol <- "TPM" countsCol <- "NumReads" lengthCol <- "EffectiveLength" if (readrStatus & is.null(importer)) { col.types <- readr::cols( readr::col_character(),readr::col_integer(),readr::col_double(),readr::col_double(),readr::col_double() ) importer <- function(x) readr::read_tsv(x, progress=FALSE, col_types=col.types) } infRepImporter <- if (dropInfReps) { NULL } else { function(x) readInfRepFish(x, type) } } # kallisto presets if (type == "kallisto") { txIdCol <- "target_id" abundanceCol <- "tpm" countsCol <- "est_counts" lengthCol <- "eff_length" if (kallisto.h5) { importer <- read_kallisto_h5 } else if (readrStatus & is.null(importer)) { col.types <- readr::cols( readr::col_character(),readr::col_integer(),readr::col_double(),readr::col_double(),readr::col_double() ) importer <- function(x) readr::read_tsv(x, progress=FALSE, col_types=col.types) } infRepImporter <- if (dropInfReps) { NULL } else { readInfRepKallisto } } # rsem presets if (type == "rsem") { if (txIn) { txIdCol <- "transcript_id" abundanceCol <- "TPM" countsCol <- "expected_count" lengthCol <- "effective_length" if (readrStatus & is.null(importer)) { col.types <- readr::cols( readr::col_character(),readr::col_character(),readr::col_integer(),readr::col_double(), readr::col_double(),readr::col_double(),readr::col_double(),readr::col_double() ) importer <- function(x) readr::read_tsv(x, progress=FALSE, col_types=col.types) } } else { geneIdCol <- "gene_id" abundanceCol <- "TPM" countsCol <- "expected_count" lengthCol <- "effective_length" if (readrStatus & is.null(importer)) { col.types <- readr::cols( readr::col_character(),readr::col_character(),readr::col_double(),readr::col_double(), readr::col_double(),readr::col_double(),readr::col_double() ) importer <- function(x) readr::read_tsv(x, progress=FALSE, col_types=col.types) } } } if (type == c("stringtie")) { txIdCol <- "t_name" geneIdCol <- "gene_name" abundanceCol <- "FPKM" countsCol <- "cov" lengthCol <- "length" if (readrStatus & is.null(importer)) { col.types <- readr::cols( readr::col_character(),readr::col_character(),readr::col_character(),readr::col_integer(),readr::col_integer(),readr::col_character(),readr::col_integer(),readr::col_integer(),readr::col_character(),readr::col_character(),readr::col_double(),readr::col_double() ) importer <- function(x) readr::read_tsv(x, progress=FALSE, col_types=col.types) } } infRepType <- "none" if (type %in% c("salmon", "sailfish", "kallisto") & !dropInfReps) { # if summarizing to gene-level, need the full matrices passed to summarizeToGene infRepType <- if (varReduce & txOut) { "var" } else { "full" } } if (dropInfReps) stopifnot(is.null(infRepStat)) # special code for RSEM gene.results files. # RSEM gene-level is the only case of !txIn if (!txIn) { txi <- computeRsemGeneLevel(files, importer, geneIdCol, abundanceCol, countsCol, lengthCol, countsFromAbundance) return(txi) } # if external tx2gene table not provided, send user to vignette if (is.null(tx2gene) & !txOut) { summarizeFail() # ...long message in helper.R } # trial run of inferential replicate info repInfo <- NULL if (infRepType != "none") { repInfo <- infRepImporter(dirname(files[1])) # if we didn't find inferential replicate info if (is.null(repInfo)) { infRepType <- "none" } } if (sparse) { if (!requireNamespace("Matrix", quietly=TRUE)) { stop("sparse import requires core R package `Matrix`") } message("importing sparsely, only counts and abundances returned, support limited to txOut=TRUE, CFA either 'no' or 'scaledTPM', and no inferential replicates") stopifnot(txOut) stopifnot(infRepType == "none") stopifnot(countsFromAbundance %in% c("no","scaledTPM")) } ###################################################### # the rest of the code assumes transcript-level input: ### --- BEGIN --- loop over files reading in columns / inf reps ### for (i in seq_along(files)) { message(i," ",appendLF=FALSE) # import and convert quantification info to data.frame raw <- as.data.frame(importer(files[i])) # import inferential replicate info if (infRepType != "none") { repInfo <- infRepImporter(dirname(files[i])) } else { repInfo <- NULL } # check for columns stopifnot(all(c(abundanceCol, countsCol, lengthCol) %in% names(raw))) # check for same-across-samples if (i == 1) { txId <- raw[[txIdCol]] } else { stopifnot(all(txId == raw[[txIdCol]])) } # if importing dense matrices if (!sparse) { # create empty matrices if (i == 1) { mat <- matrix(nrow=nrow(raw),ncol=length(files)) rownames(mat) <- raw[[txIdCol]] colnames(mat) <- names(files) abundanceMatTx <- mat countsMatTx <- mat lengthMatTx <- mat if (infRepType == "var") { varMatTx <- mat } else if (infRepType == "full") { infRepMatTx <- list() } } abundanceMatTx[,i] <- raw[[abundanceCol]] countsMatTx[,i] <- raw[[countsCol]] lengthMatTx[,i] <- raw[[lengthCol]] if (infRepType == "var") { varMatTx[,i] <- repInfo$vars } else if (infRepType == "full") { infRepMatTx[[i]] <- repInfo$reps } # if infRepStat was specified, re-compute counts and abundances if (!is.null(infRepStat)) { countsMatTx[,i] <- infRepStat(repInfo$reps) tpm <- countsMatTx[,i] / lengthMatTx[,i] abundanceMatTx[,i] <- tpm * 1e6 / sum(tpm) } } else { # try importing sparsely if (i == 1) { txId <- raw[[txIdCol]] countsListI <- list() countsListX <- list() abundanceListX <- list() numNonzero <- c() } stopifnot(all(txId == raw[[txIdCol]])) sparse.idx <- which(raw[[countsCol]] >= sparseThreshold) countsListI <- c(countsListI, sparse.idx) countsListX <- c(countsListX, raw[[countsCol]][sparse.idx]) numNonzero <- c(numNonzero, length(sparse.idx)) if (countsFromAbundance == "scaledTPM") { abundanceListX <- c(abundanceListX, raw[[abundanceCol]][sparse.idx]) } } } ### --- END --- loop over files ### # compile sparse matrices if (sparse) { countsMatTx <- Matrix::sparseMatrix(i=unlist(countsListI), j=rep(seq_along(numNonzero), numNonzero), x=unlist(countsListX), dims=c(length(txId),length(files)), dimnames=list(txId, names(files))) if (countsFromAbundance == "scaledTPM") { abundanceMatTx <- Matrix::sparseMatrix(i=unlist(countsListI), j=rep(seq_along(numNonzero), numNonzero), x=unlist(abundanceListX), dimnames=list(txId, names(files))) } else { abundanceMatTx <- NULL } lengthMatTx <- NULL } # propagate names to inferential replicate list if (infRepType == "full") { names(infRepMatTx) <- names(files) } message("") # if there is no information about inferential replicates if (infRepType == "none") { txi <- list(abundance=abundanceMatTx, counts=countsMatTx, length=lengthMatTx, countsFromAbundance=countsFromAbundance) } else if (infRepType == "var") { # if we're keeping only the variance from inferential replicates txi <- list(abundance=abundanceMatTx, counts=countsMatTx, variance=varMatTx, length=lengthMatTx, countsFromAbundance=countsFromAbundance) } else if (infRepType == "full") { # if we're keeping the full samples from inferential replicates txi <- list(abundance=abundanceMatTx, counts=countsMatTx, infReps=infRepMatTx, length=lengthMatTx, countsFromAbundance=countsFromAbundance) } # stringtie outputs coverage, here we turn into counts if (type == "stringtie") { # here "counts" is still just coverage, this formula gives back original counts txi$counts <- txi$counts * txi$length / readLength } if (type == "rsem") { # protect against 0 bp length transcripts txi$length[txi$length < 1] <- 1 } # two main outputs, based on choice of txOut: # 1) if the user requested just the transcript-level data, return it now if (txOut) { # if countsFromAbundance in {scaledTPM, lengthScaledTPM, or dtuScaledTPM} if (countsFromAbundance != "no") { # for dtuScaledTPM, pretend we're doing lengthScaledTPM w/ an altered length matrix. # note that we will still output txi$countsFromAbundance set to "dtuScaledTPM" length4CFA <- txi$length # intermediate version of the length matrix if (countsFromAbundance == "dtuScaledTPM") { length4CFA <- medianLengthOverIsoform(length4CFA, tx2gene, ignoreTxVersion, ignoreAfterBar) countsFromAbundance <- "lengthScaledTPM" } # function for computing all 3 countsFromAbundance methods: txi$counts <- makeCountsFromAbundance(countsMat=txi$counts, abundanceMat=txi$abundance, lengthMat=length4CFA, countsFromAbundance=countsFromAbundance) } return(txi) } # 2) otherwise, summarize to the gene-level txi[["countsFromAbundance"]] <- NULL txiGene <- summarizeToGene(txi, tx2gene, varReduce, ignoreTxVersion, ignoreAfterBar, countsFromAbundance) return(txiGene) } # split out this special code for RSEM with gene-level input # (all other input is transcript-level) computeRsemGeneLevel <- function(files, importer, geneIdCol, abundanceCol, countsCol, lengthCol, countsFromAbundance) { # RSEM already has gene-level summaries # so we just combine the gene-level summaries across files if (countsFromAbundance != "no") { warning("countsFromAbundance other than 'no' requires transcript-level estimates") } for (i in seq_along(files)) { message(i," ",appendLF=FALSE) out <- capture.output({ raw <- as.data.frame(importer(files[i])) }, type="message") stopifnot(all(c(geneIdCol, abundanceCol, lengthCol) %in% names(raw))) if (i == 1) { mat <- matrix(nrow=nrow(raw),ncol=length(files)) rownames(mat) <- raw[[geneIdCol]] colnames(mat) <- names(files) abundanceMat <- mat countsMat <- mat lengthMat <- mat } abundanceMat[,i] <- raw[[abundanceCol]] countsMat[,i] <- raw[[countsCol]] lengthMat[,i] <- raw[[lengthCol]] } txi <- list(abundance=abundanceMat, counts=countsMat, length=lengthMat, countsFromAbundance="no") return(txi) } tximport/README.md0000644000175400017540000000271213556120525014753 0ustar00biocbuildbiocbuild# tximport Import and summarize transcript-level estimates for transcript- and gene-level analysis Description of methods and analysis described in: * Charlotte Soneson, Michael I. Love, Mark D. Robinson. [Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences](http://f1000research.com/articles/4-1521), *F1000Research*, 4:1521, December 2015. doi: 10.12688/f1000research.7563.1 --- Imports transcript-level abundance, estimated counts and transcript lengths, and summarizes into matrices for use with downstream statistical analysis packages such as edgeR, DESeq2, limma-voom. Average transcript length, weighted by sample-specific transcript abundance estimates, is provided as a matrix which can be used as an offset for different expression of gene-level counts. See examples in the [vignette](http://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html). Notes: * tximport as of version 1.3.9 will import inferential replicates (Gibbs samples or bootstrap samples) from Salmon, Sailfish or kallisto. * Though we provide here functionality for performing gene-level differential expression using summarized transcript-level estimates, this is does not mean we suggest that users *only* perform gene-level analysis. Gene-level differential expression can be complemented with transcript- or exon-level analysis. The argument `txOut=TRUE` can be used to generate transcript-level matrices. tximport/build/0000755000175400017540000000000013556165765014610 5ustar00biocbuildbiocbuildtximport/build/vignette.rds0000644000175400017540000000035113556165765017146 0ustar00biocbuildbiocbuildmQK0-Q$&^'` qc 4B1ΓjN;ͼi!&gЙfhqNjjQZFn|d %\VignetteIndexEntry{Importing transcript abundance datasets with tximport} %\VignetteEngine{knitr::rmarkdown} --- ## Introduction Import and summarize transcript-level abundance estimates for transcript- and gene-level analysis with Bioconductor packages, such as *edgeR*, *DESeq2*, and *limma-voom*. The motivation and methods for the functions provided by the *tximport* package are described in the following article [@Soneson2015]: > Charlotte Soneson, Michael I. Love, Mark D. Robinson (2015): > Differential analyses for RNA-seq: transcript-level estimates > improve gene-level inferences. *F1000Research* > http://dx.doi.org/10.12688/f1000research.7563.1 In particular, the *tximport* pipeline offers the following benefits: (i) this approach corrects for potential changes in gene length across samples (e.g. from differential isoform usage) [@Trapnell2013Differential], (ii) some of the upstream quantification methods (*Salmon*, *Sailfish*, *kallisto*) are substantially faster and require less memory and disk usage compared to alignment-based methods that require creation and storage of BAM files, and (iii) it is possible to avoid discarding those fragments that can align to multiple genes with homologous sequence, thus increasing sensitivity [@Robert2015Errors]. **Note:** another Bioconductor package, [tximeta](https://bioconductor.org/packages/tximeta) [@Love2019], extends *tximport*, offering the same functionality, plus the additional benefit of automatic addition of annotation metadata for commonly used transcriptomes (GENCODE, Ensembl, RefSeq for human and mouse). See the [tximeta](https://bioconductor.org/packages/tximeta) package vignette for more details. Whereas `tximport` outputs a simple list of matrices, `tximeta` will output a *SummarizedExperiment* object with appropriate *GRanges* added if the transcriptome is from one of the sources above for human and mouse. ```{r, echo=FALSE} library(knitr) opts_chunk$set(tidy=TRUE,message=FALSE) ``` ## Import transcript-level estimates We begin by locating some prepared files that contain transcript abundance estimates for six samples, from the *tximportData* package. The *tximport* pipeline will be nearly identical for various quantification tools, usually only requiring one change the `type` argument. We begin with quantification files generated by the *Salmon* software, and later show the use of *tximport* with any of: * *Salmon* [@Patro2017Salmon] * *Alevin* [@Srivastava2019] * *Sailfish* [@Patro2014Sailfish] * *kallisto* [@Bray2016Near] * *RSEM* [@Li2011RSEM] * *StringTie* [@Pertea2015] First, we locate the directory containing the files. (Here we use `system.file` to locate the package directory, but for a typical use, we would just provide a path, e.g. `"/path/to/dir"`.) ```{r} library(tximportData) dir <- system.file("extdata", package="tximportData") list.files(dir) ``` Next, we create a named vector pointing to the quantification files. We will create a vector of filenames first by reading in a table that contains the sample IDs, and then combining this with `dir` and `"quant.sf.gz"`. (We gzipped the quantification files to make the data package smaller, this is not a problem for R functions that we use to import the files.) ```{r} samples <- read.table(file.path(dir,"samples.txt"), header=TRUE) samples files <- file.path(dir, "salmon", samples$run, "quant.sf.gz") names(files) <- paste0("sample",1:6) all(file.exists(files)) ``` Transcripts need to be associated with gene IDs for gene-level summarization. If that information is present in the files, we can skip this step. For Salmon, Sailfish, and kallisto the files only provide the transcript ID. We first make a data.frame called `tx2gene` with two columns: 1) transcript ID and 2) gene ID. The column names do not matter but this column order must be used. The transcript ID must be the same one used in the abundance files. Creating this `tx2gene` data.frame can be accomplished from a *TxDb* object and the `select` function from the *AnnotationDbi* package. The following code could be used to construct such a table: ```{r} library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene k <- keys(txdb, keytype="TXNAME") tx2gene <- select(txdb, k, "GENEID", "TXNAME") ``` Note: if you are using an *Ensembl* transcriptome, the easiest way to create the `tx2gene` data.frame is to use the [ensembldb](http://bioconductor.org/packages/ensembldb) packages. The annotation packages can be found by version number, and use the pattern `EnsDb.Hsapiens.vXX`. The `transcripts` function can be used with `return.type="DataFrame"`, in order to obtain something like the `df` object constructed in the code chunk above. See the *ensembldb* package vignette for more details. In this case, we've used the Gencode v27 CHR transcripts to build our index, and we used `makeTxDbFromGFF` and code similar to the chunk above to build the `tx2gene` table. We then read in a pre-constructed `tx2gene` table: ```{r} library(readr) tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv")) head(tx2gene) ``` The *tximport* package has a single function for importing transcript-level estimates. The `type` argument is used to specify what software was used for estimation ("kallisto", "salmon", "sailfish", and "rsem" are implemented). A simple list with matrices, "abundance", "counts", and "length", is returned, where the transcript level information is summarized to the gene-level. The "length" matrix can be used to generate an offset matrix for downstream gene-level differential analysis of count matrices, as shown below. **Note**: While *tximport* works without any dependencies, it is significantly faster to read in files using the *readr* package. If *tximport* detects that *readr* is installed, then it will use the `readr::read_tsv` function by default. A change from version 1.2 to 1.4 is that the reader is not specified by the user anymore, but chosen automatically based on the availability of the *readr* package. Advanced users can still customize the import of files using the `importer` argument. ```{r} library(tximport) txi <- tximport(files, type="salmon", tx2gene=tx2gene) names(txi) head(txi$counts) ``` We could alternatively generate counts from abundances, using the argument `countsFromAbundance`, scaled to library size, `"scaledTPM"`, or additionally scaled using the average transcript length, averaged over samples and to library size, `"lengthScaledTPM"`. Using either of these approaches, the counts are not correlated with length, and so the length matrix should not be provided as an offset for downstream analysis packages. As of *tximport* version 1.10, we have added a new `countsFromAbundance` option `"dtuScaledTPM"`. This scaling option is designed for use with `txOut=TRUE` for differential transcript usage analyses. See `?tximport` for details on the various `countsFromAbundance` options. We can avoid gene-level summarization by setting `txOut=TRUE`, giving the original transcript level estimates as a list of matrices. ```{r} txi.tx <- tximport(files, type="salmon", txOut=TRUE) ``` These matrices can then be summarized afterwards using the function `summarizeToGene`. This then gives the identical list of matrices as using `txOut=FALSE` (default) in the first `tximport` call. ```{r} txi.sum <- summarizeToGene(txi.tx, tx2gene) all.equal(txi$counts, txi.sum$counts) ``` ## Salmon Salmon or Sailfish `quant.sf` files can be imported by setting type to `"salmon"` or `"sailfish"`. ```{r} files <- file.path(dir,"salmon", samples$run, "quant.sf.gz") names(files) <- paste0("sample",1:6) txi.salmon <- tximport(files, type="salmon", tx2gene=tx2gene) head(txi.salmon$counts) ``` We quantified with Sailfish against a different transcriptome, so we need to read in a different `tx2gene` for this next code chunk. ```{r} tx2knownGene <- read_csv(file.path(dir, "tx2gene.csv")) files <- file.path(dir,"sailfish", samples$run, "quant.sf") names(files) <- paste0("sample",1:6) txi.sailfish <- tximport(files, type="sailfish", tx2gene=tx2knownGene) head(txi.sailfish$counts) ``` *Note*: for previous version of Salmon or Sailfish, in which the `quant.sf` files start with comment lines, it is recommended to specify the `importer` argument as a function which reads in the lines beginning with the header. For example, using the following code chunk (un-evaluated): ```{r eval=FALSE} txi <- tximport("quant.sf", type="none", txOut=TRUE, txIdCol="Name", abundanceCol="TPM", countsCol="NumReads", lengthCol="Length", importer=function(x) read_tsv(x, skip=8)) ``` ## Salmon with inferential replicates If inferential replicates (Gibbs or bootstrap samples) are present in expected locations relative to the `quant.sf` file, *tximport* will import these as well, if `txOut=TRUE`. *tximport* will not summarize inferential replicate information to the gene-level. Here we demonstrate using Salmon, run with only 5 Gibbs replicates (usually more Gibbs samples would be useful for estimating variability). ```{r} files <- file.path(dir,"salmon_gibbs", samples$run, "quant.sf.gz") names(files) <- paste0("sample",1:6) txi.inf.rep <- tximport(files, type="salmon", txOut=TRUE) names(txi.inf.rep) names(txi.inf.rep$infReps) dim(txi.inf.rep$infReps$sample1) ``` The *tximport* arguments `varReduce` and `dropInfReps` can be used to summarize the inferential replicates into a single variance per transcript and per sample, or to not import inferential replicates, respectively. ## kallisto kallisto `abundance.h5` files can be imported by setting type to `"kallisto"`. Note that this requires that you have the Bioconductor package [rhdf5](http://bioconductor.org/packages/rhdf5) installed. (Here we only demonstrate reading in transcript-level information.) ```{r} files <- file.path(dir, "kallisto_boot", samples$run, "abundance.h5") names(files) <- paste0("sample",1:6) txi.kallisto <- tximport(files, type="kallisto", txOut=TRUE) head(txi.kallisto$counts) ``` ## kallisto with inferential replicates Because the `kallisto_boot` directory also has inferential replicate information, it was imported as well (and because `txOut=TRUE`). As with Salmon, inferential replicate information will not be summarized to the gene level. ```{r} names(txi.kallisto) names(txi.kallisto$infReps) dim(txi.kallisto$infReps$sample1) ``` ## kallisto with TSV files kallisto `abundance.tsv` files can be imported as well, but this is typically slower than the approach above. Note that we add an additional argument in this code chunk, `ignoreAfterBar=TRUE`. This is because the Gencode transcripts have names like "ENST00000456328.2|ENSG00000223972.5|...", though our `tx2gene` table only includes the first "ENST" identifier. We therefore want to split the incoming quantification matrix rownames at the first bar "|", and only use this as an identifier. We didn't use this option earlier with Salmon, because we used the argument `--gencode` when running Salmon, which itself does the splitting upstream of `tximport`. Note that `ignoreTxVersion` and `ignoreAfterBar` are only to facilitating the summarization to gene level. ```{r} files <- file.path(dir, "kallisto", samples$run, "abundance.tsv.gz") names(files) <- paste0("sample",1:6) txi.kallisto.tsv <- tximport(files, type="kallisto", tx2gene=tx2gene, ignoreAfterBar=TRUE) head(txi.kallisto.tsv$counts) ``` ## RSEM RSEM `sample.genes.results` files can be imported by setting type to `"rsem"`, and `txIn` and `txOut` to `FALSE`. ```{r} files <- file.path(dir,"rsem", samples$run, paste0(samples$run, ".genes.results.gz")) names(files) <- paste0("sample",1:6) txi.rsem <- tximport(files, type="rsem", txIn=FALSE, txOut=FALSE) head(txi.rsem$counts) ``` RSEM `sample.isoforms.results` files can be imported by setting type to `"rsem"`, and `txIn` and `txOut` to `TRUE`. ```{r} files <- file.path(dir,"rsem", samples$run, paste0(samples$run, ".isoforms.results.gz")) names(files) <- paste0("sample",1:6) txi.rsem <- tximport(files, type="rsem", txIn=TRUE, txOut=TRUE) head(txi.rsem$counts) ``` ## StringTie StringTie `t_data.ctab` files giving the coverage and abundances for transcripts can be imported by setting type to `stringtie`. These files can be generated with the following command line call: ``` stringtie -eB -G transcripts.gff ``` *tximport* will compute counts from the coverage information, by reversing the formula that StringTie uses to calculate coverage (see `?tximport`). The read length is used in this formula, and so if you've set a different read length when using StringTie, you can provide this information with the `readLength` argument. The `tx2gene` table should connect transcripts to genes, and can be pulled out of one of the `t_data.ctab` files. The tximport call would look like the following (here not evaluated): ```{r, eval=FALSE} tmp <- read_tsv(files[1]) tx2gene <- tmp[,c("t_name","gene_name")] txi <- tximport(files, type="stringtie", tx2gene=tx2gene) ``` ## Alevin scRNA-seq data quantified with *Alevin* can be easily imported using *tximport*. The following unevaluated example shows import of the quants matrix (for a live example, see the unit test file `test_alevin.R`). A single file should be specified which will import a gene-by-cell matrix of data. ```{r, eval=FALSE} files <- "path/to/alevin/quants_mat.gz" txi <- tximport(files, type="alevin") ``` ## Downstream DGE in Bioconductor **Note**: there are two suggested ways of importing estimates for use with differential gene expression (DGE) methods. The first method, which we show below for *edgeR* and for *DESeq2*, is to use the gene-level estimated counts from the quantification tools, and additionally to use the transcript-level abundance estimates to calculate a gene-level offset that corrects for changes to the average transcript length across samples. The code examples below accomplish these steps for you, keeping track of appropriate matrices and calculating these offsets. For *edgeR* you need to assign a matrix to `y$offset`, but the function *DESeqDataSetFromTximport* takes care of creation of the offset for you. Let's call this method "*original counts and offset*". The second method is to use the `tximport` argument `countsFromAbundance="lengthScaledTPM"` or `"scaledTPM"`, and then to use the gene-level count matrix `txi$counts` directly as you would a regular count matrix with these software. Let's call this method "*bias corrected counts without an offset*" **Note:** Do not manually pass the original gene-level counts to downstream methods *without an offset*. The only case where this would make sense is if there is no length bias to the counts, as happens in 3' tagged RNA-seq data (see section below). The original gene-level counts are in `txi$counts` when `tximport` was run with `countsFromAbundance="no"`. This is simply passing the summed estimated transcript counts, and does not correct for potential differential isoform usage (the offset), which is the point of the *tximport* methods [@Soneson2015] for gene-level analysis. Passing uncorrected gene-level counts without an offset is not recommended by the *tximport* package authors. The two methods we provide here are: "*original counts and offset*" or "*bias corrected counts without an offset*". Passing `txi` to `DESeqDataSetFromTximport` as outlined below is correct: the function creates the appropriate offset for you to perform gene-level differential expression. ## 3' tagged RNA-seq If you have 3' tagged RNA-seq data, then correcting the counts for gene length will induce a bias in your analysis, because the counts do not have length bias. Instead of using the default full-transcript-length pipeline, we recommend to use the original counts, e.g. `txi$counts` as a counts matrix, e.g. providing to *DESeqDataSetFromMatrix* or to the *edgeR* or *limma* functions without calculating an offset and without using *countsFromAbundance*. ## edgeR An example of creating a `DGEList` for use with *edgeR* [@Robinson2010]: ```{r, results="hide", messages=FALSE} library(edgeR) library(csaw) ``` ```{r} cts <- txi$counts normMat <- txi$length # Obtaining per-observation scaling factors for length, # adjusted to avoid changing the magnitude of the counts. normMat <- normMat / exp(rowMeans(log(normMat))) normCts <- cts / normMat # Computing effective library sizes from scaled counts, # to account for composition biases between samples. library(edgeR) eff.lib <- calcNormFactors(normCts) * colSums(normCts) # Combining effective library sizes with the length factors, # and calculating offsets for a log-link GLM. normMat <- sweep(normMat, 2, eff.lib, "*") normMat <- log(normMat) # Creating a DGEList object for use in edgeR. y <- DGEList(cts) y <- scaleOffset(y, normMat) # filtering keep <- filterByExpr(y) y <- y[keep,] # y is now ready for estimate dispersion functions # see edgeR User's Guide ``` For creating a matrix of CPMs within *edgeR*, the following code chunk can be used: ```{r} se <- SummarizedExperiment(assays=list(counts=y$counts, offset=y$offset)) se$totals <- y$samples$lib.size library(csaw) cpms <- calculateCPM(se, use.offsets=TRUE, log=FALSE) ``` ## DESeq2 An example of creating a `DESeqDataSet` for use with *DESeq2* [@Love2014]: ```{r, results="hide", messages=FALSE} library(DESeq2) ``` The user should make sure the rownames of `sampleTable` align with the colnames of `txi$counts`, if there are colnames. The best practice is to read `sampleTable` from a CSV file, and to construct `files` from a column of `sampleTable`, as was shown in the *tximport* examples above. ```{r} sampleTable <- data.frame(condition=factor(rep(c("A","B"),each=3))) rownames(sampleTable) <- colnames(txi$counts) ``` ```{r} dds <- DESeqDataSetFromTximport(txi, sampleTable, ~ condition) # dds is now ready for DESeq() # see DESeq2 vignette ``` ## limma-voom An example of creating a data object for use with *limma-voom* [@Law2014]. Because limma-voom does not use the offset matrix stored in `y$offset`, we recommend using the scaled counts generated from abundances, either `"scaledTPM"` or `"lengthScaledTPM"`: ```{r} files <- file.path(dir,"salmon", samples$run, "quant.sf.gz") names(files) <- paste0("sample",1:6) txi <- tximport(files, type="salmon", tx2gene=tx2gene, countsFromAbundance="lengthScaledTPM") library(limma) y <- DGEList(txi$counts) # filtering keep <- filterByExpr(y) y <- y[keep,] y <- calcNormFactors(y) design <- model.matrix(~ condition, data=sampleTable) v <- voom(y, design) # v is now ready for lmFit() # see limma User's Guide ``` ## Acknowledgments The development of *tximport* has benefited from contributions and suggestions from: [Rob Patro](https://twitter.com/nomad421) (inferential replicates import), [Andrew Parker Morgan](https://github.com/andrewparkermorgan) (RHDF5 support), [Ryan C. Thompson](https://github.com/DarwinAwardWinner) (RHDF5 support), [Matt Shirley](https://twitter.com/mdshw5) (ignoreTxVersion), [Avi Srivastava](https://twitter.com/k3yavi) (`alevin` import), [Stephen Turner](https://twitter.com/genetics_blog), [Richard Smith-Unna](https://twitter.com/blahah404), [Rory Kirchner](https://twitter.com/RoryKirchner), [Martin Morgan](https://twitter.com/mt_morgan), Jenny Drnevich, [Patrick Kimes](https://twitter.com/pkkimes), [Leon Fodoulian](https://twitter.com/LFodoulian), [Koen Van den Berge](https://twitter.com/koenvdberge_Be), [Aaron Lun](https://github.com/LTLA) ## Session info ```{r} sessionInfo() ``` ## References tximport/man/0000755000175400017540000000000013556120525014245 5ustar00biocbuildbiocbuildtximport/man/makeCountsFromAbundance.Rd0000644000175400017540000000215013556120525021270 0ustar00biocbuildbiocbuild% Generated by roxygen2: do not edit by hand % Please edit documentation in R/helper.R \name{makeCountsFromAbundance} \alias{makeCountsFromAbundance} \title{Low-level function to make counts from abundance using matrices} \usage{ makeCountsFromAbundance(countsMat, abundanceMat, lengthMat, countsFromAbundance = c("scaledTPM", "lengthScaledTPM")) } \arguments{ \item{countsMat}{a matrix of original counts} \item{abundanceMat}{a matrix of abundances (typically TPM)} \item{lengthMat}{a matrix of effective lengths} \item{countsFromAbundance}{the desired type of count-from-abundance output} } \value{ a matrix of count-scale data generated from abundances. for details on the calculation see \link{tximport}. } \description{ Simple low-level function used within \link{tximport} to generate \code{scaledTPM} or \code{lengthScaledTPM} counts, taking as input the original counts, abundance and length matrices. NOTE: This is a low-level function exported in case it is needed for some reason, but the recommended way to generate counts-from-abundance is using \link{tximport} with the \code{countsFromAbundance} argument. } tximport/man/summarizeToGene.Rd0000644000175400017540000000242513556120525017655 0ustar00biocbuildbiocbuild% Generated by roxygen2: do not edit by hand % Please edit documentation in R/summarizeToGene.R \docType{methods} \name{summarizeToGene} \alias{summarizeToGene} \alias{summarizeToGene,list-method} \title{Summarize estimated quantitites to gene-level} \usage{ summarizeToGene(object, ...) \S4method{summarizeToGene}{list}(object, tx2gene, varReduce = FALSE, ignoreTxVersion = FALSE, ignoreAfterBar = FALSE, countsFromAbundance = c("no", "scaledTPM", "lengthScaledTPM")) } \arguments{ \item{object}{the list of matrices of trancript-level abundances, counts, lengths produced by \code{\link{tximport}}, with a \code{countsFromAbundance} element that tells how the counts were generated.} \item{...}{additional arguments, ignored} \item{tx2gene}{see \code{\link{tximport}}} \item{varReduce}{see \code{\link{tximport}}} \item{ignoreTxVersion}{see \code{\link{tximport}}} \item{ignoreAfterBar}{see \code{\link{tximport}}} \item{countsFromAbundance}{see \code{\link{tximport}}} } \value{ a list of matrices of gene-level abundances, counts, lengths, (and inferential replicates or variance if inferential replicates are present). } \description{ Summarizes abundances, counts, lengths, (and inferential replicates or variance) from transcript- to gene-level. } \seealso{ \code{\link{tximport}} } tximport/man/tximport.Rd0000644000175400017540000002342113556120525016424 0ustar00biocbuildbiocbuild% Generated by roxygen2: do not edit by hand % Please edit documentation in R/tximport.R \name{tximport} \alias{tximport} \title{Import transcript-level abundances and counts for transcript- and gene-level analysis packages} \usage{ tximport(files, type = c("none", "salmon", "sailfish", "alevin", "kallisto", "rsem", "stringtie"), txIn = TRUE, txOut = FALSE, countsFromAbundance = c("no", "scaledTPM", "lengthScaledTPM", "dtuScaledTPM"), tx2gene = NULL, varReduce = FALSE, dropInfReps = FALSE, infRepStat = NULL, ignoreTxVersion = FALSE, ignoreAfterBar = FALSE, geneIdCol, txIdCol, abundanceCol, countsCol, lengthCol, importer = NULL, existenceOptional = FALSE, sparse = FALSE, sparseThreshold = 1, readLength = 75, forceSlow = FALSE) } \arguments{ \item{files}{a character vector of filenames for the transcript-level abundances} \item{type}{character, the type of software used to generate the abundances. Options are "salmon", "sailfish", "alevin", "kallisto", "rsem", "stringtie", or "none". This argument is used to autofill the arguments below (geneIdCol, etc.) "none" means that the user will specify these columns.} \item{txIn}{logical, whether the incoming files are transcript level (default TRUE)} \item{txOut}{logical, whether the function should just output transcript-level (default FALSE)} \item{countsFromAbundance}{character, either "no" (default), "scaledTPM", "lengthScaledTPM", or "dtuScaledTPM". Whether to generate estimated counts using abundance estimates: \itemize{ \item scaled up to library size (scaledTPM), \item scaled using the average transcript length over samples and then the library size (lengthScaledTPM), or \item scaled using the median transcript length among isoforms of a gene, and then the library size (dtuScaledTPM). } dtuScaledTPM is designed for DTU analysis in combination with \code{txOut=TRUE}, and it requires specifing a \code{tx2gene} data.frame. dtuScaledTPM works such that within a gene, values from all samples and all transcripts get scaled by the same fixed median transcript length. If using scaledTPM, lengthScaledTPM, or geneLengthScaledTPM, the counts are no longer correlated across samples with transcript length, and so the length offset matrix should not be used.} \item{tx2gene}{a two-column data.frame linking transcript id (column 1) to gene id (column 2). the column names are not relevant, but this column order must be used. this argument is required for gene-level summarization for methods that provides transcript-level estimates only (kallisto, Salmon, Sailfish)} \item{varReduce}{whether to reduce per-sample inferential replicates information into a matrix of sample variances \code{variance} (default FALSE). alevin computes inferential variance by default for bootstrap inferential replicates, so this argument is ignored/not necessary} \item{dropInfReps}{whether to skip reading in inferential replicates (default FALSE). For alevin, \code{tximport} will still read in the inferential variance matrix if it exists} \item{infRepStat}{a function to re-compute counts and abundances from the inferential replicates, e.g. \code{matrixStats::rowMedians} to re-compute counts as the median of the inferential replicates. The order of operations is: first counts are re-computed, then abundances are re-computed. Following this, if \code{countsFromAbundance} is not "no", \code{tximport} will again re-compute counts from the re-computed abundances. \code{infRepStat} should operate on rows of a matrix. (default is NULL)} \item{ignoreTxVersion}{logical, whether to split the tx id on the '.' character to remove version information to facilitate matching with the tx id in \code{tx2gene} (default FALSE)} \item{ignoreAfterBar}{logical, whether to split the tx id on the '|' character to facilitate matching with the tx id in \code{tx2gene} (default FALSE)} \item{geneIdCol}{name of column with gene id. if missing, the \code{tx2gene} argument can be used} \item{txIdCol}{name of column with tx id} \item{abundanceCol}{name of column with abundances (e.g. TPM or FPKM)} \item{countsCol}{name of column with estimated counts} \item{lengthCol}{name of column with feature length information} \item{importer}{a function used to read in the files} \item{existenceOptional}{logical, should tximport not check if files exist before attempting import (default FALSE, meaning files must exist according to \code{file.exists})} \item{sparse}{logical, whether to try to import data sparsely (default is FALSE). Initial implementation for \code{txOut=TRUE}, \code{countsFromAbundance="no"} or \code{"scaledTPM"}, no inferential replicates. Only counts matrix is returned (and abundance matrix if using \code{"scaledTPM"})} \item{sparseThreshold}{the minimum threshold for including a count as a non-zero count during sparse import (default is 1)} \item{readLength}{numeric, the read length used to calculate counts from StringTie's output of coverage. Default value (from StringTie) is 75. The formula used to calculate counts is: \code{cov * transcript length / read length}} \item{forceSlow}{logical, argument used for testing. Will force the use of the slower R code for importing alevin, even if \code{fishpond} library is installed. Default is FALSE} } \value{ a simple list containing matrices: abundance, counts, length. Another list element 'countsFromAbundance' carries through the character argument used in the tximport call. The length matrix contains the average transcript length for each gene which can be used as an offset for gene-level analysis. If detected, and \code{txOut=TRUE}, inferential replicates for each sample will be imported and stored as a list of matrices, itself an element \code{infReps} in the returned list. An exception is alevin, in which the \code{infReps} are a list of bootstrap replicate matrices, where each matrix has genes as rows and cells as columns. If \code{varReduce=TRUE} the inferential replicates will be summarized according to the sample variance, and stored as a matrix \code{variance}. alevin already computes the variance of the bootstrap inferential replicates and so this is imported without needing to specify \code{varReduce=TRUE} (note that alevin uses the 1/N variance estimator, so not the same as \code{var}). } \description{ \code{tximport} imports transcript-level estimates from various external software and optionally summarizes abundances, counts, and transcript lengths to the gene-level (default) or outputs transcript-level matrices (see \code{txOut} argument). } \details{ \code{tximport} will also load in information about inferential replicates -- a list of matrices of the Gibbs samples from the posterior, or bootstrap replicates, per sample -- if these data are available in the expected locations relative to the \code{files}. The inferential replicates, stored in \code{infReps} in the output list, are on estimated counts, and therefore follow \code{counts} in the output list. By setting \code{varReduce=TRUE}, the inferential replicate matrices will be replaced by a single matrix with the sample variance per transcript/gene and per sample. While \code{tximport} summarizes to the gene-level by default, the user can also perform the import and summarization steps manually, by specifing \code{txOut=TRUE} and then using the function \code{summarizeToGene}. Note however that this is equivalent to \code{tximport} with \code{txOut=FALSE} (the default). Solutions to the error "tximport failed at summarizing to the gene-level": \enumerate{ \item provide a \code{tx2gene} data.frame linking transcripts to genes (more below) \item avoid gene-level summarization by specifying \code{txOut=TRUE} \item set \code{geneIdCol} to an appropriate column in the files } See \code{vignette('tximport')} for example code for generating a \code{tx2gene} data.frame from a \code{TxDb} object. Note that the \code{keys} and \code{select} functions used to create the \code{tx2gene} object are documented in the man page for \link[AnnotationDbi]{AnnotationDb-class} objects in the AnnotationDbi package (TxDb inherits from AnnotationDb). For further details on generating TxDb objects from various inputs see \code{vignette('GenomicFeatures')} from the GenomicFeatures package. For \code{type="alevin"} all arguments other than \code{files}, \code{dropInfReps}, and \code{forceSlow} are ignored, and \code{files} should point to a single \code{quants_mat.gz} file, in the directory structure created by the alevin software (e.g. do not move the file or delete the other important files). Note that importing alevin quantifications will be much faster by first installing the \code{fishpond} package, which contains a C++ importer for alevin's EDS format. For alevin, \code{tximport} is importing the gene-by-cell matrix of counts, as \code{txi$counts}, and effective lengths are not estimated. \code{txi$variance} may also be imported if inferential replicates were used, as well as inferential replicates if these were output by alevin. Length correction should not be applied to datasets where there is not an expected correlation of counts and feature length. } \examples{ # load data for demonstrating tximport # note that the vignette shows more examples # including how to read in files quickly using the readr package library(tximportData) dir <- system.file("extdata", package="tximportData") samples <- read.table(file.path(dir,"samples.txt"), header=TRUE) files <- file.path(dir,"salmon", samples$run, "quant.sf.gz") names(files) <- paste0("sample",1:6) # tx2gene links transcript IDs to gene IDs for summarization tx2gene <- read.csv(file.path(dir, "tx2gene.gencode.v27.csv")) txi <- tximport(files, type="salmon", tx2gene=tx2gene) } \references{ Charlotte Soneson, Michael I. Love, Mark D. Robinson (2015): Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research. \url{http://dx.doi.org/10.12688/f1000research.7563.1} } tximport/tests/0000755000175400017540000000000013556120525014634 5ustar00biocbuildbiocbuildtximport/tests/testthat/0000755000175400017540000000000013556120525016474 5ustar00biocbuildbiocbuildtximport/tests/testthat.R0000644000175400017540000000014013556120525016612 0ustar00biocbuildbiocbuildlibrary(testthat) library(tximport) library(tximportData) library(readr) test_check("tximport") tximport/tests/testthat/test_alevin.R0000644000175400017540000000270613556120525021141 0ustar00biocbuildbiocbuildcontext("alevin") test_that("import alevin works", { dir <- system.file("extdata", package="tximportData") files <- file.path(dir,"alevin/neurons_900_v012/alevin/quants_mat.gz") file.exists(files) txi <- tximport(files, type="alevin") dir <- system.file("extdata", package="tximportData") files <- file.path(dir,"alevin/neurons_900_v014/alevin/quants_mat.gz") file.exists(files) #txi <- tximport(files, type="alevin") #n <- 100 #infrep.var <- apply(abind::abind(lapply(txi$infReps, function(x) as.matrix(x[1:n,1:n])), along=3), 1:2, var) #alevin.var <- as.matrix(txi$variance[1:n,1:n]) #all.equal(alevin.var * (20/19), infrep.var, tolerance=1e-6) #n <- 200 #infrep.mu <- apply(abind::abind(lapply(txi$infReps, function(x) as.matrix(x[1:n,1:n])), along=3), 1:2, mean) #plot(txi$counts[1:n,1:n], infrep.mu) txi <- tximport(files, type="alevin", dropInfReps=TRUE) idx <- 1:1000 # Bioc Windows machine can't handle the entire matrix cts <- unname(as.matrix(txi$counts[idx,])) # compare to MM import matrix.file <- file.path(dir,"alevin/neurons_900_v014/alevin/quants_mat.mtx.gz") mat <- Matrix::readMM(matrix.file) mat <- t(as.matrix(mat[,idx])) expect_true(max(abs(mat - unname(cts))) < 1e-6) # again import alevin without fishpond txi <- tximport(files, type="alevin", dropInfReps=TRUE, forceSlow=TRUE) idx <- 1:1000 cts <- unname(as.matrix(txi$counts[idx,])) expect_true(max(abs(mat - unname(cts))) < 1e-6) }) tximport/tests/testthat/test_counts_from_abundance.R0000644000175400017540000000472613556120525024225 0ustar00biocbuildbiocbuildcontext("counts_from_abundance") test_that("getting counts from abundance works", { library(readr) dir <- system.file("extdata", package="tximportData") samples <- read.table(file.path(dir,"samples.txt"), header=TRUE) files <- file.path(dir,"salmon", samples$run, "quant.sf.gz") names(files) <- paste0("sample",1:6) tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv")) txi <- tximport(files, type="salmon", tx2gene=tx2gene) txi.S <- tximport(files, type="salmon", tx2gene=tx2gene, countsFromAbundance="scaledTPM") txi.LS <- tximport(files, type="salmon", tx2gene=tx2gene, countsFromAbundance="lengthScaledTPM") expect_true(ncol(txi.S$counts) == length(files)) expect_true(ncol(txi.LS$counts) == length(files)) # also txOut=TRUE txi.tx <- tximport(files, type="salmon", txOut=TRUE, countsFromAbundance="no") txi.tx.S <- tximport(files, type="salmon", txOut=TRUE, countsFromAbundance="scaledTPM") txi.tx.LS <- tximport(files, type="salmon", txOut=TRUE, countsFromAbundance="lengthScaledTPM") # these should not be exactly the same # lengthScaledTPM is very close, but adjusted for bias expect_true(!all(txi$counts[,1] == txi.S$counts[,1])) expect_true(!all(txi$counts[,1] == txi.LS$counts[,1])) # what if someone sumToGene() with CFA="no" after it was non-no expect_warning({ txi.sum.S <- summarizeToGene(txi.tx.S, tx2gene=tx2gene, countsFromAbundance="no") }, "incoming counts") expect_true(txi.sum.S$countsFromAbundance == "scaledTPM") expect_warning({ txi.sum.LS <- summarizeToGene(txi.tx.LS, tx2gene=tx2gene, countsFromAbundance="no") }, "incoming counts") expect_true(txi.sum.LS$countsFromAbundance == "lengthScaledTPM") # dtuScaledTPM txi.tx.dtu <- tximport(files, type="salmon", tx2gene=tx2gene, txOut=TRUE, countsFromAbundance="dtuScaledTPM") ## cors <- sapply(seq_len(nrow(txi.tx.S$counts)), function(i) { ## x <- txi.tx.LS$counts[i,] ## y <- txi.tx.dtu$counts[i,] ## if (all(x==0) | all(y==0)) NA else cor(x,y) ## }) ## hist(cors[cors > .98], col="grey") # errors for these: expect_error(tximport(files, type="salmon", txOut=TRUE, countsFromAbundance="dtuScaledTPM")) expect_error(tximport(files, type="salmon", tx2gene=tx2gene, txOut=FALSE, countsFromAbundance="dtuScaledTPM")) }) tximport/tests/testthat/test_h5.R0000644000175400017540000000121613556120525020172 0ustar00biocbuildbiocbuildcontext("h5") test_that("kallisto HDF5 import works", { library(readr) dir <- system.file("extdata", package="tximportData") samples <- read.table(file.path(dir,"samples.txt"), header=TRUE) files <- file.path(dir,"kallisto_boot", samples$run, "abundance.h5") names(files) <- paste0("sample",1:6) txi <- tximport(files, type="kallisto", txOut=TRUE) expect_true("infReps" %in% names(txi)) txi <- tximport(files, type="kallisto", txOut=TRUE, varReduce=TRUE) expect_true("variance" %in% names(txi)) txi <- tximport(files, type="kallisto", txOut=TRUE, dropInfReps=TRUE) expect_true(!any(c("infReps","variance") %in% names(txi))) }) tximport/tests/testthat/test_inf_reps.R0000644000175400017540000000305213556120525021463 0ustar00biocbuildbiocbuildcontext("inf reps") test_that("inferential replicate code works", { library(readr) dir <- system.file("extdata", package="tximportData") samples <- read.table(file.path(dir,"samples.txt"), header=TRUE) files <- file.path(dir,"salmon_gibbs", samples$run, "quant.sf.gz") names(files) <- paste0("sample",1:6) txi <- tximport(files, type="salmon", txOut=TRUE) expect_true("infReps" %in% names(txi)) txi <- tximport(files, type="salmon", txOut=TRUE, varReduce=TRUE) expect_true("variance" %in% names(txi)) txi <- tximport(files, type="salmon", txOut=TRUE, dropInfReps=TRUE) expect_true(!any(c("infReps","variance") %in% names(txi))) # test inf replicates w/ summarization # (15098 txps are missing from GTF, this is Ensembl's fault, not tximport's) tx2gene <- read_csv(file.path(dir, "tx2gene.ensembl.v87.csv")) txi <- tximport(files, type="salmon", tx2gene=tx2gene, ignoreTxVersion=TRUE) expect_true(grepl("ENSG", rownames(txi$infReps[[1]])[1])) txi <- tximport(files, type="salmon", tx2gene=tx2gene, varReduce=TRUE, ignoreTxVersion=TRUE) expect_true("variance" %in% names(txi)) txi <- tximport(files, type="salmon", tx2gene=tx2gene, dropInfReps=TRUE, ignoreTxVersion=TRUE) expect_true(!any(c("infReps","variance") %in% names(txi))) # test re-computing counts and abundances from inf replicates library(matrixStats) txi <- tximport(files, type="salmon", txOut=TRUE, infRepStat=rowMedians) txp <- which(rownames(txi$counts) == "ENST00000628356.2") expect_equal(txi$counts[txp,1], median(txi$infReps[[1]][txp,])) }) tximport/tests/testthat/test_kallisto.R0000644000175400017540000000077513556120525021511 0ustar00biocbuildbiocbuildcontext("kallisto") test_that("import kallisto works", { library(readr) dir <- system.file("extdata", package="tximportData") samples <- read.table(file.path(dir,"samples.txt"), header=TRUE) files <- file.path(dir,"kallisto", samples$run, "abundance.tsv.gz") names(files) <- paste0("sample",1:6) tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv")) txi <- tximport(files, type="kallisto", tx2gene=tx2gene, ignoreAfterBar=TRUE) expect_true(ncol(txi$counts) == length(files)) }) tximport/tests/testthat/test_no_tx2gene.R0000644000175400017540000000055013556120525021726 0ustar00biocbuildbiocbuildcontext("no_tx2gene") test_that("no tx2gene provided throws error", { dir <- system.file("extdata", package="tximportData") samples <- read.table(file.path(dir,"samples.txt"), header=TRUE) files <- file.path(dir,"salmon", samples$run, "quant.sf.gz") names(files) <- paste0("sample",1:6) expect_error(tximport(files, type="salmon", txOut=FALSE)) }) tximport/tests/testthat/test_one_sample.R0000644000175400017540000000073613556120525022006 0ustar00biocbuildbiocbuildcontext("one_sample") test_that("import one sample works", { library(readr) dir <- system.file("extdata", package="tximportData") samples <- read.table(file.path(dir,"samples.txt"), header=TRUE) files <- file.path(dir,"salmon", samples$run, "quant.sf.gz") names(files) <- paste0("sample",1:6) tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv")) txi <- tximport(files[1], type="salmon", tx2gene=tx2gene) expect_true(ncol(txi$counts) == 1) }) tximport/tests/testthat/test_rsem.R0000644000175400017540000000114213556120525020622 0ustar00biocbuildbiocbuildcontext("rsem") test_that("import rsem works", { library(readr) dir <- system.file("extdata", package="tximportData") samples <- read.table(file.path(dir,"samples.txt"), header=TRUE) files <- file.path(dir,"rsem", samples$run, paste0(samples$run, ".genes.results.gz")) names(files) <- paste0("sample",1:6) expect_message(txi.rsem <- tximport(files, type="rsem", txOut=FALSE), "looks like you") files <- file.path(dir,"rsem", samples$run, paste0(samples$run, ".isoforms.results.gz")) names(files) <- paste0("sample",1:6) txi.rsem <- tximport(files, type="rsem", txIn=TRUE, txOut=TRUE) }) tximport/tests/testthat/test_salmon.R0000644000175400017540000000206513556120525021152 0ustar00biocbuildbiocbuildcontext("salmon") test_that("import salmon works", { library(readr) dir <- system.file("extdata", package="tximportData") samples <- read.table(file.path(dir,"samples.txt"), header=TRUE) files <- file.path(dir,"salmon", samples$run, "quant.sf.gz") names(files) <- paste0("sample",1:6) tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv")) txi <- tximport(files, type="salmon", tx2gene=tx2gene) expect_true(ncol(txi$counts) == length(files)) # also test txOut here txi.txout <- tximport(files, type="salmon", txOut=TRUE) expect_true(ncol(txi.txout$counts) == length(files)) # test error for txOut and not txIn expect_error(tximport(files, type="salmon", txIn=FALSE, txOut=TRUE)) # test ignore tx version tx2gene$TXNAME <- sub("\\..*","",tx2gene$TXNAME) tx2gene$GENEID <- sub("\\..*","",tx2gene$GENEID) txi.ign.ver <- tximport(files, type="salmon", tx2gene=tx2gene, ignoreTxVersion=TRUE) # test wrong tx2gene tx2gene.bad <- data.frame(letters,letters) expect_error(tximport(files, type="salmon", tx2gene=tx2gene.bad)) }) tximport/tests/testthat/test_sparse.R0000644000175400017540000000171513556120525021157 0ustar00biocbuildbiocbuildcontext("sparse") test_that("importing sparsely works", { library(readr) dir <- system.file("extdata", package="tximportData") samples <- read.table(file.path(dir,"samples.txt"), header=TRUE) files <- file.path(dir,"salmon", samples$run, "quant.sf.gz") names(files) <- paste0("sample",1:6) tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv")) txi0 <- tximport(files, type="salmon", txOut=TRUE) txi <- tximport(files, type="salmon", txOut=TRUE, sparse=TRUE) idx <- txi0$counts[,1] >= 1 expect_equal(txi0$counts[idx,1], txi$counts[idx,1]) txi.cfa0 <- tximport(files, type="salmon", txOut=TRUE, countsFromAbundance="scaledTPM") txi.cfa <- tximport(files, type="salmon", txOut=TRUE, countsFromAbundance="scaledTPM", sparse=TRUE) idx <- txi0$counts[,1] >= 1 # test for equality with some tolerance (not exactly equal bc of thresholding for counts < 1) expect_equal(txi.cfa0$counts[idx,1], txi.cfa$counts[idx,1], tolerance=.1) }) tximport/tests/testthat/test_stringtie.R0000644000175400017540000000175013556120525021671 0ustar00biocbuildbiocbuildcontext("stringtie") test_that("import stringtie works", { # these files created with the command: # stringtie -eB -G chess1.0.gff makeData <- function(n) { data.frame(t_id = 1:n, chr = rep("chr1",n), strand = rep("+",n), start = 1:n * 1e4, end = 1:n * 1e4 + 1000, t_name = 1:n, num_exons = rep(10,n), length = 1:n * 1e3, gene_id = rep(1:10,each=n/10), gene_name = rep(letters[1:10],each=n/10), cov = rpois(n, 100), FPKM = rnorm(n, 100, 10)) } n <- 30 A <- makeData(n) B <- makeData(n) C <- makeData(n) files <- c(A="A", B="B", C="C") importer <- function(x) get(x) tx2gene <- A[,c("t_name","gene_name")] txi <- tximport(files, type="stringtie", tx2gene=tx2gene, importer=importer, existenceOptional=TRUE) txi$counts[1,1] sum(A$cov[1:3] * A$length[1:3] / 75) }) tximport/vignettes/0000755000175400017540000000000013556165765015521 5ustar00biocbuildbiocbuildtximport/vignettes/library.bib0000644000175400017540000001135013556120525017624 0ustar00biocbuildbiocbuild@article{Li2011RSEM, author = {Li, Bo and Dewey, Colin N.}, doi = {10.1186/1471-2105-12-3231}, journal = {BMC Bioinformatics}, pages = {323+}, title = {{RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.}}, url = {http://dx.doi.org/10.1186/1471-2105-12-323}, volume = {12}, year = {2011} } @article{Patro2014Sailfish, author = {Patro, Rob and Mount, Stephen M. and Kingsford, Carl}, journal = {Nature Biotechnology}, pages = {462--464}, title = {{Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms}}, url = {http://dx.doi.org/10.1038/nbt.2862}, volume = {32}, year = {2014} } @article{Patro2017Salmon, author = {Patro, Rob and Duggal, Geet and Love, Michael I. and Irizarry, Rafael A. and Kingsford, Carl}, journal = {Nature Methods}, title = {Salmon provides fast and bias-aware quantification of transcript expression}, url = {http://dx.doi.org/10.1038/nmeth.4197}, year = 2017 } @article{Bray2016Near, author = {Bray, Nicolas and Pimentel, Harold and Melsted, Pall and Pachter, Lior}, journal = {Nature Biotechnology}, pages = {525–-527}, title = {Near-optimal probabilistic RNA-seq quantification}, volume = {34}, url = {http://dx.doi.org/10.1038/nbt.3519}, year = 2016 } @article{Robert2015Errors, author = {Robert, Christelle and Watson, Mick}, doi = {10.1186/s13059-015-0734-x}, journal = {Genome Biology}, title = {{Errors in RNA-Seq quantification affect genes of relevance to human disease}}, url = {http://dx.doi.org/10.1186/s13059-015-0734-x}, year = {2015} } @article{Trapnell2013Differential, author = {Trapnell, Cole and Hendrickson, David G and Sauvageau, Martin and Goff, Loyal and Rinn, John L and Pachter, Lior}, doi = {10.1038/nbt.2450}, journal = {Nature Biotechnology}, title = {{Differential analysis of gene regulation at transcript resolution with RNA-seq}}, url = {http://dx.doi.org/10.1038/nbt.2450}, year = {2013} } @article{Soneson2015, url = {http://dx.doi.org/10.12688/f1000research.7563.1}, author = {Soneson, Charlotte and Love, Michael I. and Robinson, Mark}, title = {{Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences}}, journal = {F1000Research}, year = 2015, Volume = 4, Issue = 1521 } @article{Love2014, url = {http://dx.doi.org/10.1186/s13059-014-0550-8}, author = {Love, Michael I. and Huber, Wolfgang and Anders, Simon}, title = {{Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2}}, journal = {Genome Biology}, year = 2014, Volume = 15, Issue = 12, Pages = 550 } @article{Robinson2010, url = {http://dx.doi.org/10.1093/bioinformatics/btp616}, author = {Robinson, Mark D. and McCarthy, Davis J. and Smyth, Gordon K.}, title = {{edgeR: a Bioconductor package for differential expression analysis of digital gene expression data}}, journal = {Bioinformatics}, year = 2010, Volume = 26, Issue = 1, Pages = 139 } @article{Law2014, url = {http://dx.doi.org/10.1186/gb-2014-15-2-r29}, author = {Law, Charity W. and Chen, Yunshun and Shi, Wei and Smyth, Gordon K.}, title = {{voom: precision weights unlock linear model analysis tools for RNA-seq read counts}}, journal = {Genome Biology}, year = 2014, Volume = 15, Issue = 2, Pages = 29 } @article{Pertea2015, url = {https://dx.doi.org/10.1038%2Fnbt.3122}, author = {Pertea, Mihaela and Pertea, Geo M and Antonescu, Corina M and Chang, Tsung-Cheng and Mendell, Joshua T and Salzberg, Steven L}, title = {{StringTie enables improved reconstruction of a transcriptome from RNA-seq reads}}, journal = {Nature Biotechnology}, year = 2015, Volume = 33, Issue = 3, Pages = {290--295} } @article{Love2019, url = {https://doi.org/10.1101/777888}, author = {Love, Michael I. and Soneson, Charlotte and Hickey, Peter F. and Johnson, Lisa K. and Pierce, N. Tessa and Shepherd, Lori and Morgan, Martin and Patro, Rob}, title = {{Tximeta: reference sequence checksums for provenance identification in RNA-seq}}, journal = {bioRxiv}, year = 2019 } @article{Srivastava2019, title={{Alevin efficiently estimates accurate gene abundances from dscRNA-seq data}}, author={Srivastava, Avi and Malik, Laraib and Smith, Tom Sean and Sudbery, Ian and Patro, Rob}, journal={Genome Biology}, year={2019}, volume={20}, number={65}, url={https://doi.org/10.1186/s13059-019-1670-y} } tximport/vignettes/tximport.Rmd0000644000175400017540000004712513556120525020045 0ustar00biocbuildbiocbuild--- title: "Importing transcript abundance with tximport" author: "Michael I. Love, Charlotte Soneson, Mark D. Robinson" date: "`r Sys.Date()`" package: "`r packageVersion('tximport')`" output: rmarkdown::html_document: highlight: pygments toc: true toc_float: true fig_width: 5 bibliography: library.bib vignette: > %\VignetteIndexEntry{Importing transcript abundance datasets with tximport} %\VignetteEngine{knitr::rmarkdown} --- ## Introduction Import and summarize transcript-level abundance estimates for transcript- and gene-level analysis with Bioconductor packages, such as *edgeR*, *DESeq2*, and *limma-voom*. The motivation and methods for the functions provided by the *tximport* package are described in the following article [@Soneson2015]: > Charlotte Soneson, Michael I. Love, Mark D. Robinson (2015): > Differential analyses for RNA-seq: transcript-level estimates > improve gene-level inferences. *F1000Research* > http://dx.doi.org/10.12688/f1000research.7563.1 In particular, the *tximport* pipeline offers the following benefits: (i) this approach corrects for potential changes in gene length across samples (e.g. from differential isoform usage) [@Trapnell2013Differential], (ii) some of the upstream quantification methods (*Salmon*, *Sailfish*, *kallisto*) are substantially faster and require less memory and disk usage compared to alignment-based methods that require creation and storage of BAM files, and (iii) it is possible to avoid discarding those fragments that can align to multiple genes with homologous sequence, thus increasing sensitivity [@Robert2015Errors]. **Note:** another Bioconductor package, [tximeta](https://bioconductor.org/packages/tximeta) [@Love2019], extends *tximport*, offering the same functionality, plus the additional benefit of automatic addition of annotation metadata for commonly used transcriptomes (GENCODE, Ensembl, RefSeq for human and mouse). See the [tximeta](https://bioconductor.org/packages/tximeta) package vignette for more details. Whereas `tximport` outputs a simple list of matrices, `tximeta` will output a *SummarizedExperiment* object with appropriate *GRanges* added if the transcriptome is from one of the sources above for human and mouse. ```{r, echo=FALSE} library(knitr) opts_chunk$set(tidy=TRUE,message=FALSE) ``` ## Import transcript-level estimates We begin by locating some prepared files that contain transcript abundance estimates for six samples, from the *tximportData* package. The *tximport* pipeline will be nearly identical for various quantification tools, usually only requiring one change the `type` argument. We begin with quantification files generated by the *Salmon* software, and later show the use of *tximport* with any of: * *Salmon* [@Patro2017Salmon] * *Alevin* [@Srivastava2019] * *Sailfish* [@Patro2014Sailfish] * *kallisto* [@Bray2016Near] * *RSEM* [@Li2011RSEM] * *StringTie* [@Pertea2015] First, we locate the directory containing the files. (Here we use `system.file` to locate the package directory, but for a typical use, we would just provide a path, e.g. `"/path/to/dir"`.) ```{r} library(tximportData) dir <- system.file("extdata", package="tximportData") list.files(dir) ``` Next, we create a named vector pointing to the quantification files. We will create a vector of filenames first by reading in a table that contains the sample IDs, and then combining this with `dir` and `"quant.sf.gz"`. (We gzipped the quantification files to make the data package smaller, this is not a problem for R functions that we use to import the files.) ```{r} samples <- read.table(file.path(dir,"samples.txt"), header=TRUE) samples files <- file.path(dir, "salmon", samples$run, "quant.sf.gz") names(files) <- paste0("sample",1:6) all(file.exists(files)) ``` Transcripts need to be associated with gene IDs for gene-level summarization. If that information is present in the files, we can skip this step. For Salmon, Sailfish, and kallisto the files only provide the transcript ID. We first make a data.frame called `tx2gene` with two columns: 1) transcript ID and 2) gene ID. The column names do not matter but this column order must be used. The transcript ID must be the same one used in the abundance files. Creating this `tx2gene` data.frame can be accomplished from a *TxDb* object and the `select` function from the *AnnotationDbi* package. The following code could be used to construct such a table: ```{r} library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene k <- keys(txdb, keytype="TXNAME") tx2gene <- select(txdb, k, "GENEID", "TXNAME") ``` Note: if you are using an *Ensembl* transcriptome, the easiest way to create the `tx2gene` data.frame is to use the [ensembldb](http://bioconductor.org/packages/ensembldb) packages. The annotation packages can be found by version number, and use the pattern `EnsDb.Hsapiens.vXX`. The `transcripts` function can be used with `return.type="DataFrame"`, in order to obtain something like the `df` object constructed in the code chunk above. See the *ensembldb* package vignette for more details. In this case, we've used the Gencode v27 CHR transcripts to build our index, and we used `makeTxDbFromGFF` and code similar to the chunk above to build the `tx2gene` table. We then read in a pre-constructed `tx2gene` table: ```{r} library(readr) tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv")) head(tx2gene) ``` The *tximport* package has a single function for importing transcript-level estimates. The `type` argument is used to specify what software was used for estimation ("kallisto", "salmon", "sailfish", and "rsem" are implemented). A simple list with matrices, "abundance", "counts", and "length", is returned, where the transcript level information is summarized to the gene-level. The "length" matrix can be used to generate an offset matrix for downstream gene-level differential analysis of count matrices, as shown below. **Note**: While *tximport* works without any dependencies, it is significantly faster to read in files using the *readr* package. If *tximport* detects that *readr* is installed, then it will use the `readr::read_tsv` function by default. A change from version 1.2 to 1.4 is that the reader is not specified by the user anymore, but chosen automatically based on the availability of the *readr* package. Advanced users can still customize the import of files using the `importer` argument. ```{r} library(tximport) txi <- tximport(files, type="salmon", tx2gene=tx2gene) names(txi) head(txi$counts) ``` We could alternatively generate counts from abundances, using the argument `countsFromAbundance`, scaled to library size, `"scaledTPM"`, or additionally scaled using the average transcript length, averaged over samples and to library size, `"lengthScaledTPM"`. Using either of these approaches, the counts are not correlated with length, and so the length matrix should not be provided as an offset for downstream analysis packages. As of *tximport* version 1.10, we have added a new `countsFromAbundance` option `"dtuScaledTPM"`. This scaling option is designed for use with `txOut=TRUE` for differential transcript usage analyses. See `?tximport` for details on the various `countsFromAbundance` options. We can avoid gene-level summarization by setting `txOut=TRUE`, giving the original transcript level estimates as a list of matrices. ```{r} txi.tx <- tximport(files, type="salmon", txOut=TRUE) ``` These matrices can then be summarized afterwards using the function `summarizeToGene`. This then gives the identical list of matrices as using `txOut=FALSE` (default) in the first `tximport` call. ```{r} txi.sum <- summarizeToGene(txi.tx, tx2gene) all.equal(txi$counts, txi.sum$counts) ``` ## Salmon Salmon or Sailfish `quant.sf` files can be imported by setting type to `"salmon"` or `"sailfish"`. ```{r} files <- file.path(dir,"salmon", samples$run, "quant.sf.gz") names(files) <- paste0("sample",1:6) txi.salmon <- tximport(files, type="salmon", tx2gene=tx2gene) head(txi.salmon$counts) ``` We quantified with Sailfish against a different transcriptome, so we need to read in a different `tx2gene` for this next code chunk. ```{r} tx2knownGene <- read_csv(file.path(dir, "tx2gene.csv")) files <- file.path(dir,"sailfish", samples$run, "quant.sf") names(files) <- paste0("sample",1:6) txi.sailfish <- tximport(files, type="sailfish", tx2gene=tx2knownGene) head(txi.sailfish$counts) ``` *Note*: for previous version of Salmon or Sailfish, in which the `quant.sf` files start with comment lines, it is recommended to specify the `importer` argument as a function which reads in the lines beginning with the header. For example, using the following code chunk (un-evaluated): ```{r eval=FALSE} txi <- tximport("quant.sf", type="none", txOut=TRUE, txIdCol="Name", abundanceCol="TPM", countsCol="NumReads", lengthCol="Length", importer=function(x) read_tsv(x, skip=8)) ``` ## Salmon with inferential replicates If inferential replicates (Gibbs or bootstrap samples) are present in expected locations relative to the `quant.sf` file, *tximport* will import these as well, if `txOut=TRUE`. *tximport* will not summarize inferential replicate information to the gene-level. Here we demonstrate using Salmon, run with only 5 Gibbs replicates (usually more Gibbs samples would be useful for estimating variability). ```{r} files <- file.path(dir,"salmon_gibbs", samples$run, "quant.sf.gz") names(files) <- paste0("sample",1:6) txi.inf.rep <- tximport(files, type="salmon", txOut=TRUE) names(txi.inf.rep) names(txi.inf.rep$infReps) dim(txi.inf.rep$infReps$sample1) ``` The *tximport* arguments `varReduce` and `dropInfReps` can be used to summarize the inferential replicates into a single variance per transcript and per sample, or to not import inferential replicates, respectively. ## kallisto kallisto `abundance.h5` files can be imported by setting type to `"kallisto"`. Note that this requires that you have the Bioconductor package [rhdf5](http://bioconductor.org/packages/rhdf5) installed. (Here we only demonstrate reading in transcript-level information.) ```{r} files <- file.path(dir, "kallisto_boot", samples$run, "abundance.h5") names(files) <- paste0("sample",1:6) txi.kallisto <- tximport(files, type="kallisto", txOut=TRUE) head(txi.kallisto$counts) ``` ## kallisto with inferential replicates Because the `kallisto_boot` directory also has inferential replicate information, it was imported as well (and because `txOut=TRUE`). As with Salmon, inferential replicate information will not be summarized to the gene level. ```{r} names(txi.kallisto) names(txi.kallisto$infReps) dim(txi.kallisto$infReps$sample1) ``` ## kallisto with TSV files kallisto `abundance.tsv` files can be imported as well, but this is typically slower than the approach above. Note that we add an additional argument in this code chunk, `ignoreAfterBar=TRUE`. This is because the Gencode transcripts have names like "ENST00000456328.2|ENSG00000223972.5|...", though our `tx2gene` table only includes the first "ENST" identifier. We therefore want to split the incoming quantification matrix rownames at the first bar "|", and only use this as an identifier. We didn't use this option earlier with Salmon, because we used the argument `--gencode` when running Salmon, which itself does the splitting upstream of `tximport`. Note that `ignoreTxVersion` and `ignoreAfterBar` are only to facilitating the summarization to gene level. ```{r} files <- file.path(dir, "kallisto", samples$run, "abundance.tsv.gz") names(files) <- paste0("sample",1:6) txi.kallisto.tsv <- tximport(files, type="kallisto", tx2gene=tx2gene, ignoreAfterBar=TRUE) head(txi.kallisto.tsv$counts) ``` ## RSEM RSEM `sample.genes.results` files can be imported by setting type to `"rsem"`, and `txIn` and `txOut` to `FALSE`. ```{r} files <- file.path(dir,"rsem", samples$run, paste0(samples$run, ".genes.results.gz")) names(files) <- paste0("sample",1:6) txi.rsem <- tximport(files, type="rsem", txIn=FALSE, txOut=FALSE) head(txi.rsem$counts) ``` RSEM `sample.isoforms.results` files can be imported by setting type to `"rsem"`, and `txIn` and `txOut` to `TRUE`. ```{r} files <- file.path(dir,"rsem", samples$run, paste0(samples$run, ".isoforms.results.gz")) names(files) <- paste0("sample",1:6) txi.rsem <- tximport(files, type="rsem", txIn=TRUE, txOut=TRUE) head(txi.rsem$counts) ``` ## StringTie StringTie `t_data.ctab` files giving the coverage and abundances for transcripts can be imported by setting type to `stringtie`. These files can be generated with the following command line call: ``` stringtie -eB -G transcripts.gff ``` *tximport* will compute counts from the coverage information, by reversing the formula that StringTie uses to calculate coverage (see `?tximport`). The read length is used in this formula, and so if you've set a different read length when using StringTie, you can provide this information with the `readLength` argument. The `tx2gene` table should connect transcripts to genes, and can be pulled out of one of the `t_data.ctab` files. The tximport call would look like the following (here not evaluated): ```{r, eval=FALSE} tmp <- read_tsv(files[1]) tx2gene <- tmp[,c("t_name","gene_name")] txi <- tximport(files, type="stringtie", tx2gene=tx2gene) ``` ## Alevin scRNA-seq data quantified with *Alevin* can be easily imported using *tximport*. The following unevaluated example shows import of the quants matrix (for a live example, see the unit test file `test_alevin.R`). A single file should be specified which will import a gene-by-cell matrix of data. ```{r, eval=FALSE} files <- "path/to/alevin/quants_mat.gz" txi <- tximport(files, type="alevin") ``` ## Downstream DGE in Bioconductor **Note**: there are two suggested ways of importing estimates for use with differential gene expression (DGE) methods. The first method, which we show below for *edgeR* and for *DESeq2*, is to use the gene-level estimated counts from the quantification tools, and additionally to use the transcript-level abundance estimates to calculate a gene-level offset that corrects for changes to the average transcript length across samples. The code examples below accomplish these steps for you, keeping track of appropriate matrices and calculating these offsets. For *edgeR* you need to assign a matrix to `y$offset`, but the function *DESeqDataSetFromTximport* takes care of creation of the offset for you. Let's call this method "*original counts and offset*". The second method is to use the `tximport` argument `countsFromAbundance="lengthScaledTPM"` or `"scaledTPM"`, and then to use the gene-level count matrix `txi$counts` directly as you would a regular count matrix with these software. Let's call this method "*bias corrected counts without an offset*" **Note:** Do not manually pass the original gene-level counts to downstream methods *without an offset*. The only case where this would make sense is if there is no length bias to the counts, as happens in 3' tagged RNA-seq data (see section below). The original gene-level counts are in `txi$counts` when `tximport` was run with `countsFromAbundance="no"`. This is simply passing the summed estimated transcript counts, and does not correct for potential differential isoform usage (the offset), which is the point of the *tximport* methods [@Soneson2015] for gene-level analysis. Passing uncorrected gene-level counts without an offset is not recommended by the *tximport* package authors. The two methods we provide here are: "*original counts and offset*" or "*bias corrected counts without an offset*". Passing `txi` to `DESeqDataSetFromTximport` as outlined below is correct: the function creates the appropriate offset for you to perform gene-level differential expression. ## 3' tagged RNA-seq If you have 3' tagged RNA-seq data, then correcting the counts for gene length will induce a bias in your analysis, because the counts do not have length bias. Instead of using the default full-transcript-length pipeline, we recommend to use the original counts, e.g. `txi$counts` as a counts matrix, e.g. providing to *DESeqDataSetFromMatrix* or to the *edgeR* or *limma* functions without calculating an offset and without using *countsFromAbundance*. ## edgeR An example of creating a `DGEList` for use with *edgeR* [@Robinson2010]: ```{r, results="hide", messages=FALSE} library(edgeR) library(csaw) ``` ```{r} cts <- txi$counts normMat <- txi$length # Obtaining per-observation scaling factors for length, # adjusted to avoid changing the magnitude of the counts. normMat <- normMat / exp(rowMeans(log(normMat))) normCts <- cts / normMat # Computing effective library sizes from scaled counts, # to account for composition biases between samples. library(edgeR) eff.lib <- calcNormFactors(normCts) * colSums(normCts) # Combining effective library sizes with the length factors, # and calculating offsets for a log-link GLM. normMat <- sweep(normMat, 2, eff.lib, "*") normMat <- log(normMat) # Creating a DGEList object for use in edgeR. y <- DGEList(cts) y <- scaleOffset(y, normMat) # filtering keep <- filterByExpr(y) y <- y[keep,] # y is now ready for estimate dispersion functions # see edgeR User's Guide ``` For creating a matrix of CPMs within *edgeR*, the following code chunk can be used: ```{r} se <- SummarizedExperiment(assays=list(counts=y$counts, offset=y$offset)) se$totals <- y$samples$lib.size library(csaw) cpms <- calculateCPM(se, use.offsets=TRUE, log=FALSE) ``` ## DESeq2 An example of creating a `DESeqDataSet` for use with *DESeq2* [@Love2014]: ```{r, results="hide", messages=FALSE} library(DESeq2) ``` The user should make sure the rownames of `sampleTable` align with the colnames of `txi$counts`, if there are colnames. The best practice is to read `sampleTable` from a CSV file, and to construct `files` from a column of `sampleTable`, as was shown in the *tximport* examples above. ```{r} sampleTable <- data.frame(condition=factor(rep(c("A","B"),each=3))) rownames(sampleTable) <- colnames(txi$counts) ``` ```{r} dds <- DESeqDataSetFromTximport(txi, sampleTable, ~ condition) # dds is now ready for DESeq() # see DESeq2 vignette ``` ## limma-voom An example of creating a data object for use with *limma-voom* [@Law2014]. Because limma-voom does not use the offset matrix stored in `y$offset`, we recommend using the scaled counts generated from abundances, either `"scaledTPM"` or `"lengthScaledTPM"`: ```{r} files <- file.path(dir,"salmon", samples$run, "quant.sf.gz") names(files) <- paste0("sample",1:6) txi <- tximport(files, type="salmon", tx2gene=tx2gene, countsFromAbundance="lengthScaledTPM") library(limma) y <- DGEList(txi$counts) # filtering keep <- filterByExpr(y) y <- y[keep,] y <- calcNormFactors(y) design <- model.matrix(~ condition, data=sampleTable) v <- voom(y, design) # v is now ready for lmFit() # see limma User's Guide ``` ## Acknowledgments The development of *tximport* has benefited from contributions and suggestions from: [Rob Patro](https://twitter.com/nomad421) (inferential replicates import), [Andrew Parker Morgan](https://github.com/andrewparkermorgan) (RHDF5 support), [Ryan C. Thompson](https://github.com/DarwinAwardWinner) (RHDF5 support), [Matt Shirley](https://twitter.com/mdshw5) (ignoreTxVersion), [Avi Srivastava](https://twitter.com/k3yavi) (`alevin` import), [Stephen Turner](https://twitter.com/genetics_blog), [Richard Smith-Unna](https://twitter.com/blahah404), [Rory Kirchner](https://twitter.com/RoryKirchner), [Martin Morgan](https://twitter.com/mt_morgan), Jenny Drnevich, [Patrick Kimes](https://twitter.com/pkkimes), [Leon Fodoulian](https://twitter.com/LFodoulian), [Koen Van den Berge](https://twitter.com/koenvdberge_Be), [Aaron Lun](https://github.com/LTLA) ## Session info ```{r} sessionInfo() ``` ## References