uniqtag/0000755000176200001440000000000014250560072011723 5ustar liggesusersuniqtag/NAMESPACE0000644000176200001440000000033514236555041013147 0ustar liggesusers# Generated by roxygen2: do not edit by hand export(cumcount) export(kmers_of) export(make_unique) export(make_unique_all) export(make_unique_all_or_none) export(make_unique_duplicates) export(uniqtag) export(vkmers_of) uniqtag/LICENSE0000644000176200001440000000005314236554426012740 0ustar liggesusersCOPYRIGHT HOLDER: Shaun Jackman YEAR: 2015 uniqtag/README.md0000644000176200001440000000451414236522620013207 0ustar liggesusersUniqTag ======= Abbreviate strings to short unique identifiers For each string in a set of strings, determine a unique tag that is a substring of fixed size *k* unique to that string, if it has one. If no such unique substring exists, the least frequent substring is used. If multiple unique substrings exist, the lexicographically smallest substring is used. This lexicographically smallest substring of size *k* is called the uniqtag of that string. Installation ================================================================================ Command line program ------------------------------------------------------------ ```sh curl -o ~/bin/uniqtag https://raw.githubusercontent.com/sjackman/uniqtag/master/uniqtag chmod +x ~/bin/uniqtag ``` or using [Homebrew](https://brew.sh/) on macOS or Linux ```sh brew install uniqtag ``` R package ------------------------------------------------------------ Install from CRAN ```r install.packages("uniqtag") ``` or from GitHub ``` install.packages("devtools") devtools::install_github("sjackman/uniqtag") ``` Publication ================================================================================ - Shaun D. Jackman, Joerg Bohlmann, İnanç Birol (2015) UniqTag: Content-derived unique and stable identifiers for gene annotation. *PLOS ONE*, [doi:10.1371/journal.pone.0128026](https://doi.org/10.1371/journal.pone.0128026). - Shaun D. Jackman, Joerg Bohlmann, İnanç Birol (2014) UniqTag: Content-derived unique and stable identifiers for gene annotation. *bioRxiv*, [doi:10.1101/007583](https://doi.org/10.1101/007583). - https://github.com/sjackman/uniqtag-paper Summary ======= When working on an ongoing genome sequencing and assembly project, it is rather inconvenient when gene identifiers change from one build of the assembly to the next. The gene labelling system described here, UniqTag, addresses this common challenge. UniqTag assigns a unique identifier to each gene that is a representative *k*-mer, a string of length *k*, selected from the sequence of that gene. Unlike serial numbers, these identifiers are stable between different assemblies and annotations of the same data without requiring that previous annotations be lifted over by sequence alignment. We assign UniqTag identifiers to nine builds of the Ensembl human genome spanning seven years to demonstrate this stability. uniqtag/man/0000755000176200001440000000000014236301274012477 5ustar liggesusersuniqtag/man/make_unique.Rd0000644000176200001440000000303214236322572015273 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/uniqtag.R \name{make_unique} \alias{make_unique} \alias{make_unique_duplicates} \alias{make_unique_all} \alias{make_unique_all_or_none} \title{Make character strings unique.} \usage{ make_unique(xs, sep = "-") make_unique_duplicates(xs, sep = "-") make_unique_all(xs, sep = "-") make_unique_all_or_none(xs, sep = "-") } \arguments{ \item{xs}{a character vector} \item{sep}{a character string used to separate a duplicate string from its sequence number} } \description{ Append sequence numbers to duplicate elements to make all elements of a character vector unique. } \section{Functions}{ \itemize{ \item \code{make_unique}: Append a sequence number to duplicated elements, including the first occurrence. \item \code{make_unique_duplicates}: Append a sequence number to duplicated elements, except the first occurrence. This function behaves similarly to make.unique \item \code{make_unique_all}: Append a sequence number to every element. \item \code{make_unique_all_or_none}: Append a sequence number to every element or no elements. Return \code{xs} unchanged if the elements of the character vector \code{xs} are already unique. Otherwise append a sequence number to every element. }} \examples{ abcb <- c("a", "b", "c", "b") make_unique(abcb) make_unique_duplicates(abcb) make_unique_all(abcb) make_unique_all_or_none(abcb) make_unique_all_or_none(c("a", "b", "c")) x <- make_unique(abbreviate(state.name, 3, strict = TRUE)) x[grep("-", x)] } \seealso{ make.unique } uniqtag/man/uniqtag.Rd0000644000176200001440000000336114236301274014441 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/uniqtag.R \name{uniqtag} \alias{uniqtag} \title{Abbreviate strings to short, unique identifiers.} \usage{ uniqtag(xs, k = 9, uniq = make_unique_all_or_none, sep = "-") } \arguments{ \item{xs}{a character vector} \item{k}{the size of the identifier, an integer} \item{uniq}{a function to make the abbreviations unique, such as make_unique, make_unique_duplicates, make_unique_all_or_none, make_unique_all, make.unique, or to disable this function, identity or NULL} \item{sep}{a character string used to separate a duplicate string from its sequence number} } \value{ a character vector of the UniqTags of the strings \code{x} } \description{ Abbreviate strings to unique substrings of \code{k} characters. } \details{ For each string in a set of strings, determine a unique tag that is a substring of fixed size \code{k} unique to that string, if it has one. If no such unique substring exists, the least frequent substring is used. If multiple unique substrings exist, the lexicographically smallest substring is used. This lexicographically smallest substring of size \code{k} is called the UniqTag of that string. The lexicographically smallest substring depend on the locale's sort order. You may wish to first call \code{Sys.setlocale("LC_COLLATE", "C")} } \examples{ Sys.setlocale("LC_COLLATE", "C") states <- sub(" ", "", state.name) uniqtags <- uniqtag(states) uniqtags4 <- uniqtag(states, k = 4) uniqtags3 <- uniqtag(states, k = 3) uniqtags3x <- uniqtag(states, k = 3, uniq = make_unique) table(nchar(states)) table(nchar(uniqtags)) table(nchar(uniqtags4)) table(nchar(uniqtags3)) table(nchar(uniqtags3x)) uniqtags3[grep("-", uniqtags3x)] } \seealso{ abbreviate, locales, make.unique } uniqtag/man/uniqtag-package.Rd0000644000176200001440000000123114236311137016023 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/uniqtag.R \docType{package} \name{uniqtag-package} \alias{uniqtag-package} \title{Abbreviate strings to short, unique identifiers.} \description{ For each string in a set of strings, determine a unique tag that is a substring of fixed size k unique to that string, if it has one. If no such unique substring exists, the least frequent substring is used. If multiple unique substrings exist, the lexicographically smallest substring is used. This lexicographically smallest substring of size k is called the "UniqTag" of that string. } \author{ Shaun Jackman \email{sjackman@gmail.com} } uniqtag/man/cumcount.Rd0000644000176200001440000000073314236301274014626 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/uniqtag.R \name{cumcount} \alias{cumcount} \title{Cumulative count of strings.} \usage{ cumcount(xs) } \arguments{ \item{xs}{a character vector} } \value{ an integer vector of the cumulative string counts } \description{ Return an integer vector counting the number of occurrences of each string up to that position in the vector. } \examples{ cumcount(abbreviate(state.name, 3, strict = TRUE)) } uniqtag/man/kmers_of.Rd0000644000176200001440000000143414236301274014575 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/uniqtag.R \name{kmers_of} \alias{kmers_of} \alias{vkmers_of} \title{Return the k-mers of a string.} \usage{ kmers_of(x, k) vkmers_of(xs, k) } \arguments{ \item{x}{a character string} \item{k}{the size of the substrings, an integer} \item{xs}{a character vector} } \value{ kmers_of: a character vector of the k-mers of \code{x} vkmers_of: a list of character vectors of the k-mers of \code{xs} } \description{ Return the k-mers (substrings of size \code{k}) of the string \code{x}, or return the string \code{x} itself if it is shorter than k. } \section{Functions}{ \itemize{ \item \code{kmers_of}: Return the k-mers of the string \code{x}. \item \code{vkmers_of}: Return the k-mers of the strings \code{xs}. }} uniqtag/DESCRIPTION0000644000176200001440000000174014250560072013433 0ustar liggesusersType: Package Package: uniqtag Title: Abbreviate Strings to Short, Unique Identifiers Version: 1.0.1 Authors@R: person("Shaun", "Jackman", email = "sjackman@gmail.com", role = c("aut", "cph", "cre")) Description: For each string in a set of strings, determine a unique tag that is a substring of fixed size k unique to that string, if it has one. If no such unique substring exists, the least frequent substring is used. If multiple unique substrings exist, the lexicographically smallest substring is used. This lexicographically smallest substring of size k is called the "UniqTag" of that string. License: MIT + file LICENSE Encoding: UTF-8 RoxygenNote: 7.1.2 URL: https://github.com/sjackman/uniqtag BugReports: https://github.com/sjackman/uniqtag/issues Suggests: testthat NeedsCompilation: no Packaged: 2022-05-10 21:34:38 UTC; shaun.jackman Author: Shaun Jackman [aut, cph, cre] Maintainer: Shaun Jackman Repository: CRAN Date/Publication: 2022-06-10 06:10:02 UTC uniqtag/tests/0000755000176200001440000000000014236301107013061 5ustar liggesusersuniqtag/tests/testthat/0000755000176200001440000000000014250560072014725 5ustar liggesusersuniqtag/tests/testthat/test-kmers-of.R0000644000176200001440000000034714236301107017550 0ustar liggesuserstest_that("kmers_of", expect_equal( kmers_of("hello", 3), c("hel", "ell", "llo"))) test_that("vkmers_of", expect_equal( vkmers_of(c("hello", "world"), 3), list(hello = c("hel", "ell", "llo"), world = c("wor", "orl", "rld")))) uniqtag/tests/testthat/test-make-unique.R0000644000176200001440000000110114236301107020233 0ustar liggesusersabc <- c("a", "b", "c") abcb <- c("a", "b", "c", "b") test_that("make_unique", expect_equal( make_unique(abcb), c("a", "b-1", "c", "b-2"))) test_that("make_unique_duplicates", expect_equal( make_unique_duplicates(abcb), c("a", "b", "c", "b-1"))) test_that("make_unique_all", expect_equal( make_unique_all(abcb), c("a-1", "b-1", "c-1", "b-2"))) test_that("make_unique_all_or_none 1", expect_equal( make_unique_all_or_none(abcb), c("a-1", "b-1", "c-1", "b-2"))) test_that("make_unique_all_or_none 2", expect_equal( make_unique_all_or_none(abc), c("a", "b", "c"))) uniqtag/tests/testthat/test-uniqtag.R0000644000176200001440000000255714236301107017502 0ustar liggesusersSys.setlocale("LC_COLLATE", "C") test_that("uniqtag aaaaaab k=3", expect_equal( unname(uniqtag(c("aaaaaab", "aaab"), k = 3)), c("aaa-1", "aaa-2"))) test_that("uniqtag aaaaaab k=4", expect_equal( unname(uniqtag(c("aaaaaab", "aaab"), k = 4)), c("aaaa", "aaab"))) states <- sub(" ", "", state.name) states3 <- setNames(c( "aba-1", "las-1", "Ari-1", "Ark-1", "Cal-1", "Col-1", "Con-1", "Del-1", "Flo-1", "Geo-1", "Haw-1", "Ida-1", "Ill-1", "Ind-1", "Iow-1", "Kan-1", "Ken-1", "Lou-1", "Mai-1", "Mar-1", "Mas-1", "Mic-1", "Min-1", "ipp-1", "our-1", "Mon-1", "Neb-1", "Nev-1", "Ham-1", "Jer-1", "Mex-1", "Yor-1", "Car-1", "Dak-1", "Ohi-1", "Okl-1", "Ore-1", "Pen-1", "Isl-1", "Car-2", "Dak-2", "Ten-1", "Tex-1", "Uta-1", "Ver-1", "Vir-1", "Was-1", "Wes-1", "Wis-1", "Wyo-1"), states) test_that("uniqtag states k=3", expect_equal(uniqtag(states, k = 3), states3)) states4 <- setNames(c( "Alab", "Alas", "Ariz", "Arka", "Cali", "Colo", "Conn", "Dela", "Flor", "Geor", "Hawa", "Idah", "Illi", "Indi", "Iowa", "Kans", "Kent", "Loui", "Main", "Mary", "Mass", "Mich", "Minn", "ippi", "isso", "Mont", "Nebr", "Neva", "Hamp", "Jers", "Mexi", "NewY", "rthC", "rthD", "Ohio", "Okla", "Oreg", "Penn", "Isla", "uthC", "uthD", "Tenn", "Texa", "Utah", "Verm", "Virg", "Wash", "West", "Wisc", "Wyom"), states) test_that("uniqtag states k=4", expect_equal(uniqtag(states, k = 4), states4)) uniqtag/tests/testthat.R0000644000176200001440000000007214236301107015043 0ustar liggesuserslibrary(testthat) library(uniqtag) test_check("uniqtag") uniqtag/R/0000755000176200001440000000000014236322443012126 5ustar liggesusersuniqtag/R/uniqtag.R0000644000176200001440000001322614236322443013725 0ustar liggesusers#' Abbreviate strings to short, unique identifiers. #' #' For each string in a set of strings, determine a unique tag that is a substring of fixed size k #' unique to that string, if it has one. If no such unique substring exists, the least frequent #' substring is used. If multiple unique substrings exist, the lexicographically smallest substring #' is used. This lexicographically smallest substring of size k is called the "UniqTag" of that #' string. #' @docType package #' @name uniqtag-package #' @author Shaun Jackman \email{sjackman@@gmail.com} NULL #' Return the k-mers of a string. #' #' Return the k-mers (substrings of size \code{k}) of the string \code{x}, or #' return the string \code{x} itself if it is shorter than k. #' @describeIn kmers_of Return the k-mers of the string \code{x}. #' @param k the size of the substrings, an integer #' @param x a character string #' @return kmers_of: a character vector of the k-mers of \code{x} #' @export kmers_of <- function(x, k) { if (nchar(x) < k) { x } else { substring(x, 1:(nchar(x) - k + 1), k:nchar(x)) } } #' @describeIn kmers_of Return the k-mers of the strings \code{xs}. #' @param xs a character vector #' @return vkmers_of: a list of character vectors of the k-mers of \code{xs} #' @export vkmers_of <- function(xs, k) { Vectorize(kmers_of, SIMPLIFY = FALSE)(xs, k) } #' Cumulative count of strings. #' #' Return an integer vector counting the number of occurrences of each string up to that position in the vector. #' @param xs a character vector #' @return an integer vector of the cumulative string counts #' @examples #' cumcount(abbreviate(state.name, 3, strict = TRUE)) #' @export cumcount <- function(xs) { counts <- new.env(parent = emptyenv()) stats::setNames(vapply( xs, function(x) { counts[[x]] <- 1L + mget(x, counts, ifnotfound = 0L)[[1]] }, integer(1) ), xs) } #' Make character strings unique. #' #' Append sequence numbers to duplicate elements to make all elements of a character vector unique. #' @param xs a character vector #' @param sep a character string used to separate a duplicate string from its sequence number #' @describeIn make_unique Append a sequence number to duplicated elements, including the first occurrence. #' @seealso make.unique #' @examples #' abcb <- c("a", "b", "c", "b") #' make_unique(abcb) #' make_unique_duplicates(abcb) #' make_unique_all(abcb) #' make_unique_all_or_none(abcb) #' make_unique_all_or_none(c("a", "b", "c")) #' x <- make_unique(abbreviate(state.name, 3, strict = TRUE)) #' x[grep("-", x)] #' @export make_unique <- function(xs, sep = "-") { i <- xs %in% xs[duplicated(xs)] xs[i] <- make_unique_all(xs[i], sep) xs } #' @describeIn make_unique Append a sequence number to duplicated elements, except the first occurrence. #' #' This function behaves similarly to make.unique #' @export make_unique_duplicates <- function(xs, sep = "-") { i <- duplicated(xs) xs[i] <- make_unique_all(xs[i], sep) xs } #' @describeIn make_unique Append a sequence number to every element. #' @export make_unique_all <- function(xs, sep = "-") { xs[] <- paste(xs, cumcount(xs), sep = sep) xs } #' @describeIn make_unique Append a sequence number to every element or no elements. #' #' Return \code{xs} unchanged if the elements of the character vector \code{xs} are already unique. #' Otherwise append a sequence number to every element. #' @export make_unique_all_or_none <- function(xs, sep = "-") { if (anyDuplicated(xs)) make_unique_all(xs, sep) else xs } #' Abbreviate strings to short, unique identifiers. #' #' Abbreviate strings to unique substrings of \code{k} characters. #' #' For each string in a set of strings, determine a unique tag that is a substring of fixed size \code{k} unique to that string, if it has one. If no such unique substring exists, the least frequent substring is used. If multiple unique substrings exist, the lexicographically smallest substring is used. This lexicographically smallest substring of size \code{k} is called the UniqTag of that string. #' #' The lexicographically smallest substring depend on the locale's sort order. #' You may wish to first call \code{Sys.setlocale("LC_COLLATE", "C")} #' #' @examples #' Sys.setlocale("LC_COLLATE", "C") #' states <- sub(" ", "", state.name) #' uniqtags <- uniqtag(states) #' uniqtags4 <- uniqtag(states, k = 4) #' uniqtags3 <- uniqtag(states, k = 3) #' uniqtags3x <- uniqtag(states, k = 3, uniq = make_unique) #' table(nchar(states)) #' table(nchar(uniqtags)) #' table(nchar(uniqtags4)) #' table(nchar(uniqtags3)) #' table(nchar(uniqtags3x)) #' uniqtags3[grep("-", uniqtags3x)] #' @param xs a character vector #' @param k the size of the identifier, an integer #' @param uniq a function to make the abbreviations unique, such as make_unique, make_unique_duplicates, make_unique_all_or_none, make_unique_all, make.unique, or to disable this function, identity or NULL #' @param sep a character string used to separate a duplicate string from its sequence number #' @return a character vector of the UniqTags of the strings \code{x} #' @seealso abbreviate, locales, make.unique #' @export uniqtag <- function(xs, k = 9, uniq = make_unique_all_or_none, sep = "-") { if (is.null(uniq)) { uniq <- identity sep <- NA } counts <- table(unlist(lapply(vkmers_of(xs, k), unique))) counts_kmers <- stats::setNames( paste0(format(counts, justify = "right"), names(counts)), names(counts) ) tags <- vapply( xs, function(x) { names(counts_kmers)[match(min(counts_kmers[kmers_of(x, k)]), counts_kmers)] }, character(1) ) if (is.na(sep)) uniq(tags) else uniq(tags, sep) } uniqtag/MD50000644000176200001440000000133114250560072012231 0ustar liggesusers1a70522556eb1730da4097d84c5982a5 *DESCRIPTION d933c18e193e58499c8b7eb3a73beff8 *LICENSE 7d0253a676ca06b93e580748524ef900 *NAMESPACE a6b47356daa0aeeb837a8e320afbef38 *R/uniqtag.R 660e33d22d94431684cd0195248258bb *README.md e4e00e1a44af5390247f87ac27d1bae4 *man/cumcount.Rd 71206d3a9d75fb7b15241690a9aab687 *man/kmers_of.Rd 34cf7dce4e4cb0b87ef65af4379c9524 *man/make_unique.Rd 42e047b11bc911790707a304e927cf89 *man/uniqtag-package.Rd c100680a1de44b1edad2ab0a27460643 *man/uniqtag.Rd 4d957ae64a6c64be45db5dd5a38f1592 *tests/testthat.R 82c152aad01f744b1a862941afff5283 *tests/testthat/test-kmers-of.R a2bc0c4bbf15da4b795c06e8bbee30b8 *tests/testthat/test-make-unique.R 1867f837288914dd145a6c9377933b86 *tests/testthat/test-uniqtag.R