tcR/ 0000755 0001762 0000144 00000000000 13446170643 011012 5 ustar ligges users tcR/inst/ 0000755 0001762 0000144 00000000000 13446161025 011761 5 ustar ligges users tcR/inst/library.report.Rmd 0000644 0001762 0000144 00000005732 12710611033 015402 0 ustar ligges users ---
title: "TCR beta repertoire exploratory analysis"
output:
html_document:
theme: spacelab
toc: yes
---
## Before analysis
Before running this pipeline you should do next steps:
1. Save your parsed MiTCR data to the `immdata` variable (it could be **a mitcr data frame or a list**).
```
immdata <- parse.folder('/home/username/mitcrdata/')
```
2. Save the `immdata` variable to the some folder as the `.rda` file.
```
save(immdata, file = '/home/username/immdata.rda')
```
3. In the code block below change the path string to the path to yours `immdata.rda` file. After that click the **Knit HTML** button to start analysis and the script will make a html report.
```{r loaddata,warning=FALSE,message=F}
load('../data/twb.rda')
immdata <- twb[1:2]
library(tcR)
```
4. Friendly advice: run the pipeline on the top N sequences first and configure sizes of figures.
```{r lapplyhead,warning=FALSE,message=F}
N <- 50000
immdata <- head(immdata, N)
```
## Data statistics
```{r seqstats,warning=FALSE,message=F}
cloneset.stats(immdata)
repseq.stats(immdata)
```
## Segments' statistics
### V-segment usage
```{r vusagehist, fig.width=16, fig.height=5,warning=FALSE,message=F}
if (has.class(immdata, 'list')) {
for (i in 1:length(immdata)) {
plot(vis.gene.usage(immdata[[i]], HUMAN_TRBV, .main = paste0(names(immdata)[i], ' ', 'V-usage')))
}
} else {
vis.gene.usage(immdata, HUMAN_TRBV, .coord.flip=F)
}
```
### J-segment usage
```{r jusagehist, fig.width=14, fig.height=5,warning=FALSE,message=F}
if (has.class(immdata, 'list')) {
for (i in 1:length(immdata)) {
plot(vis.gene.usage(immdata[[i]], HUMAN_TRBJ, .main = paste0(names(immdata)[i], ' ', 'J-usage')))
}
} else {
vis.gene.usage(immdata, HUMAN_TRBJ, .coord.flip=F)
}
```
## CDR3 length and read distributions
```{r cdr3len, fig.width=10, fig.height=5,warning=FALSE,message=F}
if (has.class(immdata, 'list')) {
for (i in 1:length(immdata)) {
plot(vis.count.len(immdata[[i]], .name = paste0(names(immdata)[i], ' ', 'CDR3 length')))
}
} else {
vis.count.len(immdata)
}
```
```{r readdistr, fig.width=10, fig.height=5,warning=FALSE,message=F}
if (has.class(immdata, 'list')) {
for (i in 1:length(immdata)) {
plot(vis.number.count(immdata[[i]], .name = paste0(names(immdata)[i], ' ', 'read histogram')))
}
} else {
vis.number.count(immdata)
}
```
## Proportions of the most abundant clones
```{r topprop, fig.width=13,warning=FALSE,message=F}
vis.top.proportions(immdata)
```
## Most frequent kmers
```{r kmers, fig.width=14,warning=FALSE,message=F}
kms <- get.kmers(immdata, .verbose = F)
vis.kmer.histogram(kms, .position = 'fill')
```
## Rarefaction analysis
```{r muc, fig.width=11,warning=FALSE,message=F}
clmn <- 'Read.count'
if (has.class(immdata, 'list')) {
if (!is.na(immdata[[1]]$Umi.count[1])) {
clmn <- 'Umi.count'
}
} else {
if (!is.na(immdata$Umi.count[1])) {
clmn <- 'Umi.count'
}
}
vis.rarefaction(rarefaction(immdata, .col = clmn, .verbose = F), .log = T)
``` tcR/inst/CITATION 0000644 0001762 0000144 00000001101 12657351347 013122 0 ustar ligges users bibentry('Article',
title = 'tcR: an R package for T-cell receptor repertoire data analysis',
author = as.person("Vadim I. Nazarov [aut], Mikhail V. Pogorelyy [aut], Ekaterina A. Komech [aut], Ivan V. Zvyagin [aut], Dmitry A. Bolotin [aut], Mikhail Shugay [aut], Dmitry M. Chudakov [aut], Yury B. Lebedev and Ilgar Z. Mamedov [aut]"),
year = 2015,
url = "http://www.biomedcentral.com/1471-2105/16/175",
journal = "BMC Bioinformatics",
pages = 175,
volume = 16,
doi = "10.1186/s12859-015-0613-1"
) tcR/inst/crossanalysis.report.Rmd 0000644 0001762 0000144 00000011733 13325616566 016654 0 ustar ligges users ---
title: "TCR beta repertoire intergroup analysis"
output:
html_document:
theme: spacelab
toc: yes
---
## Before analysis
Before running this pipeline you should do next steps:
1. Save your parsed MiTCR data to the `immdata` variable (**it must be a list with mitcr data frames**).
```
immdata <- parse.folder('/home/username/mitcrdata/')
```
2. Save the `immdata` variable to the some folder as the `.rda` file.
```
save(immdata, file = '/home/username/immdata.rda')
```
3. In the code block below change the path string to the path to yours `immdata.rda` file. After that click the **Knit HTML** button to start analysis and make an output .html file with it's results.
```{r loaddata,warning=FALSE,message=F}
load('../data/twb.rda')
immdata <- twb
library(tcR)
```
4. Friendly advice: run the pipeline on first N top sequences first and then set up the size of figures.
```{r lapplyhead,warning=FALSE,message=F}
N <- 10000
immdata <- lapply(immdata, head, N)
```
## Number of shared clones and clonotypes
### Number of shared clones (CDR3 nucleotide sequences)
#### -- without and with normalisation
```{r nuc0crosses, fig.width=13,warning=FALSE,message=F}
crs1 <- repOverlap(immdata, .norm=F, .verbose=F)
crs2 <- repOverlap(immdata, .norm=T, .verbose=F)
do.call(grid.arrange, list(vis.heatmap(crs1, .title = 'Number of shared clones', .legend = 'Shared clones'), vis.heatmap(crs2, .title = 'Number of shared clones', .legend = 'Shared clones'), nrow = 1))
```
### Number of shared clonotypes (CDR3 amino acid sequences)
#### -- without and with normalisation
```{r aa0crosses, fig.width=13,warning=FALSE,message=F}
crs1 <- repOverlap(immdata, .seq = "aa", .norm=F, .verbose=F)
crs2 <- repOverlap(immdata, .seq = "aa", .norm=T, .verbose=F)
do.call(grid.arrange, list(vis.heatmap(crs1), vis.heatmap(crs2), nrow = 1))
```
### Number of shared clones using V-segments
#### -- without and with normalisation
```{r nucvcrosses, fig.width=13,warning=FALSE,message=F}
crs1 <- repOverlap(immdata, .vgene = T, .norm=F, .verbose=F)
crs2 <- repOverlap(immdata, .vgene = T, .norm=T, .verbose=F)
do.call(grid.arrange, list(vis.heatmap(crs1, .title = 'Number of shared clones + V', .legend = 'Shared clones'), vis.heatmap(crs2, .title = 'Number of shared clones + V'), nrow = 1))
```
### Number of shared clonotypes using V-segments
#### -- without and with normalisation
```{r aavcrosses, fig.width=13,warning=FALSE,message=F}
crs1 <- repOverlap(immdata, .seq = "aa", .vgene = T, .norm=F, .verbose=F)
crs2 <- repOverlap(immdata, .seq = "aa", .vgene = T, .norm=T, .verbose=F)
do.call(grid.arrange, list(vis.heatmap(crs1, .title = 'Number of shared clonotypes + V'), vis.heatmap(crs2, .title = 'Number of shared clonotypes + V'), nrow = 1))
```
## Segments statistics
### V-segments usage
```{r vusagehist, fig.width=16, fig.height=10,warning=FALSE,message=F}
vis.gene.usage(immdata, HUMAN_TRBV, .ncol = 2, .coord.flip = F)
```
### V-segments summary statistics
```{r vseboxplot, fig.width = 13, fig.height=10,warning=FALSE,message=F}
# Change the groups variable for plotting V-usage boxplot for groups.
groups <- list(Group.A = names(immdata)[1:(length(immdata) / 2)],
Group.B = names(immdata)[(length(immdata) / 2 + 1) : length(immdata)])
vis.group.boxplot(geneUsage(immdata, HUMAN_TRBV, .norm = T), groups, .rotate.x = T)
```
### J-segments usage
```{r jusagehist, fig.width=16, fig.height=10,warning=FALSE,message=F}
vis.gene.usage(immdata, HUMAN_TRBJ, .coord.flip=F, .ncol = 2)
```
## Jennsen - Shannon Divergence applied to the segments frequency of supplied data frames
### V-segments Jennsen - Shannon Divergence among repertoires
```{r vdiv, fig.width=13,warning=FALSE,message=F,prompt=FALSE}
res <- js.div.seg(immdata, HUMAN_TRBV, .frame='all', .verbose = F)
vis.heatmap(round(res, 5))
```
### V-segments Jennsen - Shannon Divergence radarplot
```{r radars, fig.width=13,warning=FALSE,message=F,prompt=FALSE}
res <- js.div.seg(immdata, HUMAN_TRBV, .frame='all', .verbose = F)
vis.radarlike(res, .ncol = 2)
```
## PCA on segments' frequencies
### PCA on V-segments statistics
```{r pcav,warning=FALSE,message=F}
pca.segments(immdata)
```
### PCA on V-J segments statistics
```{r pcavj,warning=FALSE,message=F}
pca.segments.2D(immdata, .genes = list(HUMAN_TRBV, HUMAN_TRBJ))
```
## Top cross
```{r topcross, fig.width=16, fig.height=16,warning=FALSE,message=F}
top.cross.plot(top.cross(permutedf(immdata), seq(500, min(sapply(immdata, nrow)), 500), .verbose = F))
```
## Shared repertoire statistics by clonotypes using V-segments
```{r shared,warning=FALSE,message=F}
imm.sh <- shared.repertoire(immdata, 'av', .verbose = F)
shared.clones.count(imm.sh)
shared.representation(imm.sh)
```
## Rarefaction group analysis
```{r muc, fig.width=11,warning=FALSE,message=F}
clmn <- 'Read.count'
if (!is.na(immdata[[1]]$Umi.count[1])) {
clmn <- 'Umi.count'
}
vis.rarefaction(rarefaction(immdata, .col = clmn, .verbose = F), list(A = c("Subj.A", "Subj.B"), B = c("Subj.C", "Subj.D")), .log = T)
``` tcR/inst/doc/ 0000755 0001762 0000144 00000000000 13446161025 012526 5 ustar ligges users tcR/inst/doc/tcrvignette.Rmd 0000644 0001762 0000144 00000106613 13446161024 015536 0 ustar ligges users ---
title: '
tcR: a package for T cell receptor and Immunoglobulin repertoires advanced data analysis
Vadim I. Nazarov
'
author:
Laboratory
of Comparative and Functional Genomics, IBCH RAS, Moscow, Russia
output:
html_document:
theme: spacelab
toc: yes
toc_depth: 4
pdf_document:
toc: yes
toc_depth: 4
word_document: default
---
## Introduction
The *tcR* package designed to help researchers in the immunology field to analyse T cell receptor (`TCR`) and immunoglobulin (`Ig`) repertoires.
In this vignette, I will cover procedures for immune receptor repertoire analysis provided with the package.
Terms:
- Clonotype: a group of T / B cell clones with equal CDR3 nucleotide sequences and equal Variable genes.
- Cloneset / repertoire: a set of clonotypes. Represented as a data frame in which each row corresponds to a unique clonotype.
- UMI: Unique Molecular Identifier (see this [paper](http://www.nature.com/nmeth/journal/v9/n1/full/nmeth.1778.html) for details)
```{r eval=TRUE,echo=FALSE,warning=FALSE,message=FALSE}
library(tcR)
data(twa)
data(twb)
```
### Package features
- Parsers for outputs of various tools for CDR3 extraction and genes alignment *(currently implemented parsers for MiTCR, MiGEC, VDJtools, ImmunoSEQ, IMSEQ and MiXCR)*
- Data manipulation *(in-frame / out-of-frame sequences subsetting, clonotype motif search)*
- Descriptive statistics *(number of reads, number of clonotypes, gene segment usage)*
- Shared clonotypes statistics *(number of shared clonotypes, using V genes or not; sequential intersection among the most abundant clonotype ("top-cross"))*
- Repertoire comparison *(Jaccard index, Morisita's overlap index, Horn's index, Tversky index, overlap coefficient)*
- V- and J genes usage and it's analysis *(PCA, Shannon Entropy, Jensen-Shannon Divergence)*
- Diversity evaluation *(ecological diversity index, Gini index, inverse Simpson index, rarefaction analysis)*
- Artificial repertoire generation (beta chain only, for now)
- Spectratyping
- Various visualisation procedures
- Mutation networks *(graphs, in which vertices represent CDR3 nucleotide / amino acid sequences and edges are connecting similar sequences with low hamming or edit distance between them)*
### Data in the package
There are two datasets provided with the package - twins data and V(D)J recombination genes data.
#### Downsampled twins data
`twa.rda`, `twb.rda` - two lists with 4 data frames in each list. Every data frame is a sample downsampled to the 10000 most abundant clonotypes of twins data (alpha and beta chains). Full data is available here:
[Twins TCR data at Laboratory of Comparative and Functional Genomics](http://labcfg.ibch.ru/tcr.html)
Explore the data:
```{r eval=FALSE,echo=TRUE}
# Load the package and load the data.
library(tcR)
data(twa) # "twa" - list of length 4
data(twb) # "twb" - list of length 4
# Explore the data.
head(twa[[1]])
head(twb[[1]])
```
#### Gene alphabets
Gene alphabets - character vectors with names of genes for TCR and Ig.
```{r eval=FALSE,echo=TRUE}
# Run help to see available alphabets.
?genealphabets
?genesegments
data(genesegments)
```
### Quick start / automatic report generation
For the exploratory analysis of a single repertoire, use the RMarkdown report file at
`"/inst/library.report.Rmd"`
Analysis in the file include statistics and visualisation of number of clones, clonotypes, in- and out-of-frame sequences, unique amino acid CDR3 sequences, V- and J-usage, most frequent k-mers, rarefaction analysis.
For the analysis of a group of repertoires ("cross-analysis"), use the RMarkdown report file at:
`"/inst/crossanalysis.report.Rmd}"`
Analysis in this file include statistics and visualisation of number of shared clones and clonotypes, V-usage for individuals and groups, J-usage for individuals, Jensen-Shannon divergence among V-usages of repertoires and top-cross.
You need the *knitr* package installed in order to generate reports from default pipelines. In RStudio you can run a pipeline file as follows:
`Run RStudio -> load the pipeline .Rmd files -> press the knitr button`
### Input parsing
Currently in *tcR* there are implemented parser for the next software:
- MiTCR - `parse.mitcr`;
- MiTCR w/ UMIs - `parse.mitcrbc`;
- MiGEC - `parse.migec`;
- VDJtools - `parse.vdjtools`;
- ImmunoSEQ - `parse.immunoseq`;
- MiXCR - `parse.mixcr`;
- IMSEQ - `parse.imseq`.
Also a general parser `parse.cloneset` for a text table files is implemented. General wrapper for parsers is `parse.file`. User can also parse a list of files or the entire folder. Run `?parse.folder` to see a help on parsing input files and a list of functions for parsing a specific input format.
```{r eval=FALSE,echo=TRUE}
# Parse file in "~/mitcr/immdata1.txt" as a MiTCR file.
immdata1 <- parse.file("~/mitcr_data/immdata1.txt", 'mitcr')
# equivalent to
immdata1.eq <- parse.mitcr("~/mitcr_data/immdata1.txt")
# Parse folder with MiGEC files.
immdata <- parse.folder("~/migec_data/", 'migec')
```
### Cloneset representation
Clonesets represented in *tcR* as data frames with each row corresponding to the one nucleotide clonotype and with specific column names:
- *Umi.count* - number of UMIs;
- *Umi.proportion* - proportion of UMIs;
- *Read.count* - number of reads;
- *Read.proportion* - proportion of reads;
- *CDR3.nucleotide.sequence* - CDR3 nucleotide sequence;
- *CDR3.amino.acid.sequence* - CDR3 amino acid sequence;
- *V.gene* - names of aligned Variable genes;
- *J.gene* - names of aligned Joining genes;
- *D.gene* - names of aligned Diversity genes;
- *V.end* - last positions of aligned V genes (1-based);
- *J.start* - first positions of aligned J genes (1-based);
- *D5.end* - positions of D'5 end of aligned D genes (1-based);
- *D3.end* - positions of D'3 end of aligned D genes (1-based);
- *VD.insertions* - number of inserted nucleotides (N-nucleotides) at V-D junction (-1 for receptors with VJ recombination);
- *DJ.insertions* - number of inserted nucleotides (N-nucleotides) at D-J junction (-1 for receptors with VJ recombination);
- *Total.insertions* - total number of inserted nucleotides (number of N-nucleotides at V-J junction for receptors with VJ recombination).
Any data frame with this columns and of this class is suitable for processing with the package, hence user can generate their own table files and load them for the further analysis using `read.csv`, `read.table` and other `base` R functions. Please note that *tcR* internally expects all strings to be of class "character", not "factor". Therefore you should use R parsing functions with parameter *stringsAsFactors=FALSE*.
```{r eval=TRUE, echo=TRUE}
# No D genes is available here hence "" at "D.genes" and "-1" at positions.
str(twa[[1]])
str(twb[[1]])
```
## Repertoire descriptive statistics
For the exploratory analysis *tcR* provides various functions for computing descriptive statistics.
### Cloneset summary
To get a general view of a subject's repertoire (overall count of sequences, in- and out-of-frames numbers and proportions) use the `cloneset.stats` function. It returns a `summary` of counts of nucleotide and amino acid clonotypes, as well as summary of read counts:
```{r eval=TRUE,echo=TRUE}
cloneset.stats(twb)
```
For characterisation of a library use the `repseq.stats` function:
```{r eval=TRUE,echo=TRUE}
repseq.stats(twb)
```
### Most abundant clonotypes statistics
Function `clonal.proportion` is used to get the number of most abundant by the count of reads clonotypes. E.g., compute number of clonotypes which fill up (approx.) the 25% from total repertoire's "Read.count":
```{r eval=TRUE,echo=TRUE}
# How many clonotypes fill up approximately
clonal.proportion(twb, 25) # the 25% of the sum of values in 'Read.count'?
```
To get a proportion of the most abundant clonotypes' sum of reads to the overall number of reads in a repertoire, use `top.proportion`, i.e. get
($\sum$ reads of top clonotypes)$/$($\sum$ reads for all clonotypes).
E.g., get a proportion of the top-10 clonotypes' reads to the overall number of reads:
```{r echo=TRUE, eval=TRUE, fig=TRUE, fig.height=4, fig.width=5.5, message=FALSE, fig.align='center'}
# What accounts a proportion of the top-10 clonotypes' reads
top.proportion(twb, 10) # to the overall number of reads?
vis.top.proportions(twb) # Plot this proportions.
```
Function `tailbound.proportion` with two arguments *.col* and *.bound* gets subset of the given data frame with clonotypes which have column *.col* with value $\leq$ *.bound* and computes the ratio of sums of count reads of such subset to the overall data frame. E.g., get proportion of sum of reads of sequences which has "Read.count" <= 100 to the overall number of reads:
```{r eval=TRUE,echo=TRUE}
# What is a proportion of sequences which
# have 'Read.count' <= 100 to the
tailbound.proportion(twb, 100) # overall number of reads?
```
### Clonal space homeostasis
Clonal space homeostasis is a useful statistics of how many space occupied by clonotypes with specific proportions.
```{r eval=TRUE, echo=TRUE, fig.height=4, fig.width=6.5, fig.align='center'}
# data(twb)
# Compute summary space of clones, that occupy
# [0, .05) and [.05, 1] proportion.
clonal.space.homeostasis(twb, c(Low = .05, High = 1))
# Use default arguments:
clonal.space.homeostasis(twb[[1]])
twb.space <- clonal.space.homeostasis(twb)
vis.clonal.space(twb.space)
```
### In-frame and out-of-frame sequences
Functions for performing subsetting and counting number of in-frame and out-of-frame clonotypes are: `count.inframes`, `count.outframes`, `get.inframes`, `get.outframes`. Parameter *.head* for this functions is a parameter to the *.head* function, that applied to the input data frame or an input list of data frames before subsetting. Functions accept both data frames and list of data frames as parameters. E.g., get data frame with only in-frame sequences and count out-of-frame sequences in the first 5000 rows for this data frame:
```{r eval=TRUE,echo=TRUE}
imm.in <- get.inframes(twb) # Return all in-frame sequences from the 'twb'.
# Count the number of out-of-frame sequences
count.outframes(twb, 5000) # from the first 5000 sequences.
```
General functions with parameter stands for 'all' (all sequences), 'in' (only in-frame sequences) or 'out' (only out-of-frame sequences) are *get.frames* and *count.frames*:
```{r eval=TRUE,echo=TRUE}
imm.in <- get.frames(twb, 'in') # Similar to 'get.inframes(twb)'.
count.frames(twb[[1]], 'all') # Just return number of rows.
flag <- 'out'
count.frames(twb, flag, 5000) # Similar to 'count.outframes(twb, 5000)'.
```
### Search for a target CDR3 sequences
For exact or fuzzy search of sequences the package employed a function `find.clonotypes`. Input arguments for this function are a data frame or a list of data frames, targets (a character vector or data frame having one column with sequences and additional columns with, e.g., V genes), a value of which column or columns to return, a method to be used to compare sequences among each other (either "exact" for exact matching, "hamm" for matching sequences by Hamming distance (two sequences are matched if H $\leq$ 1) or "lev" for matching sequences by Levenshtein distance (two sequences are matched if L $\leq$ 1)), and column name from which sequences for matching are obtained. Sounds very complex, but in practice it's very easy, therefore let's go to examples.
Suppose we want to search for some CDR3 sequences in a number of repertoires:
```{r eval=TRUE,echo=TRUE}
cmv <- data.frame(CDR3.amino.acid.sequence = c('CASSSANYGYTF', 'CSVGRAQNEQFF', 'CASSLTGNTEAFF', 'CASSALGGAGTGELFF', 'CASSLIGVSSYNEQFF'),
V.genes = c('TRBV4-1', 'TRBV4-1', 'TRBV4-1', 'TRBV4-1', 'TRBV4-1'), stringsAsFactors = F)
cmv
```
We will search for them using all methods of matching (exact, hamming or levenshtein) and with and without matching by V-segment. Also, for the first case (exact matching and without V gene) we return "Total.insertions" column along with the "Read.count" column, and for the second case output will be a "Rank" - rank (generated by `set.rank`) of a clone or a clonotype in a data frame.
```{r eval=TRUE,echo=TRUE}
twb <- set.rank(twb)
# Case 1.
cmv.imm.ex <-
find.clonotypes(.data = twb[1:2], .targets = cmv[,1], .method = 'exact',
.col.name = c('Read.count', 'Total.insertions'),
.verbose = F)
head(cmv.imm.ex)
# Case 2.
# Search for CDR3 sequences with hamming distance <= 1
# to the one of the cmv$CDR3.amino.acid.sequence with
# matching V genes. Return ranks of found sequences.
cmv.imm.hamm.v <-
find.clonotypes(twb[1:3], cmv, 'hamm', 'Rank',
.target.col = c('CDR3.amino.acid.sequence',
'V.gene'),
.verbose = F)
head(cmv.imm.hamm.v)
# Case 3.
# Similar to the previous example, except
# using levenshtein distance and the "Read.count" column.
cmv.imm.lev.v <-
find.clonotypes(twb[1:3], cmv, 'lev',
.target.col = c('CDR3.amino.acid.sequence', 'V.gene'),
.verbose = F)
head(cmv.imm.lev.v)
```
## Gene usage
Variable and Joining gene usage (V-usage and J-usage) are important characteristics of repertoires. To access and compare them among repertoires *tcR* provides a few useful functions.
### Gene usage computing
To access V- and J-usage of a repertoire *tcR* provides functions `geneUsage`. Function `geneUsage`, depending on parameters, computes frequencies or counts of the given elements (e.g., V genes) of the input data frame or the input list of data frames. V and J gene names for humans for TCR and Ig are stored in the .rda file `genesegments.rda` (they are identical to those form IMGT: \href{http://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=nomenclatures&species=human&group=TRBV}{link to beta genes (red ones)} and \href{http://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=nomenclatures&species=human&group=TRAV}{link to alpha genes (red ones)}). All of the mentioned functions are accept data frames as well as list of data frames. Output for those functions are data frames with the first column stands for a gene and the other for frequencies.
```{r eval=TRUE,echo=TRUE}
imm1.vs <- geneUsage(twb[[1]], HUMAN_TRBV)
head(imm1.vs)
imm.vs.all <- geneUsage(twb, HUMAN_TRBV)
imm.vs.all[1:10, 1:4]
# Compute joint V-J counts
imm1.vj <- geneUsage(twb[[1]], list(HUMAN_TRBV, HUMAN_TRBJ))
imm1.vj[1:5, 1:5]
```
You can also directly visualise gene usage with the function `vis.gene.usage` (if you pass the gene alphabet as a second argument):
```{r eval=TRUE, echo=TRUE, message=FALSE, fig.align='center', fig.height=5, fig.width=7}
# Put ".dodge = F" to get distinct plot for every data frame in the given list.
vis.gene.usage(twb, HUMAN_TRBJ, .main = 'twb J-usage dodge', .dodge = T)
```
```{r eval=TRUE, echo=TRUE, message=FALSE, fig.align='center', fig.height=6, fig.width=9}
vis.gene.usage(twb, HUMAN_TRBJ, .main = 'twb J-usage column', .dodge = F, .ncol = 2)
```
```{r eval=TRUE, echo=TRUE, message=FALSE, fig.align='center', fig.height=5, fig.width=7}
vis.gene.usage(imm1.vs, NA, .main = 'twb[[1]] V-usage', .coord.flip = F)
```
### Gene usage comparing
To evaluate V- and J genes usage of repertoires, the package implements subroutines for two approaches to the analysis: measures from the information theory and PCA (Principal Component Analysis).
#### Shannon entropy and Jensen-Shannon divergence
To assess the diversity of genes usage user can use the `entropy` function. Kullback-Leibler assymetric measure (function `kl.div`) and Jensen-Shannon symmetric measure (functions `js.div` for computing JS-divergence between the given distributions and `js.div.seg` for computing JS-divergence between genes distributions of two clonesets or a list with data frames) are provided to estimate distance among gene usage of different repertoires. To visualise distances *tcR* employed the `vis.radarlike` function, see Section "Plots" for more detailed information.
```{r eval=T, echo=TRUE, fig.align='center'}
# Transform "0:100" to distribution with Laplace correction
entropy(0:100, .laplace = 1) # (i.e., add "1" to every value before transformation).
entropy.seg(twb, HUMAN_TRBV) # Compute entropy of V-segment usage for each data frame.
js.div.seg(twb[1:2], HUMAN_TRBV, .verbose = F)
imm.js <- js.div.seg(twb, HUMAN_TRBV, .verbose = F)
vis.radarlike(imm.js, .ncol = 2)
```
#### Principal Component Analysis (PCA)
Principal component analysis (PCA) is a statistical procedure for transforming a set of observations to a set of special values for analysis. In *tcR* implemented functions `pca.segments` for performing PCA on V- or J-usage, and `pca.segments.2D` for performing PCA on VJ-usage. For plotting the PCA results see the `vis.pca` function.
```{r eval=TRUE, echo=TRUE, fig.align='center', fig.height=4.5, fig.width=6}
pca.segments(twb, .genes = HUMAN_TRBV) # Plot PCA results of V-segment usage.
# Return object of class "prcomp"
class(pca.segments(twb, .do.plot = F, .genes = HUMAN_TRBV))
```
## Repertoire overlap analysis
*tcR* provides a number of functions for evaluating similarity of clonesets based on shared among clonesets clonotypes and working with data frames with shared clonotypes.
### Overlap quantification
The general interface to all functions for computing cloneset overlap coefficients is the `repOverlap` function.
#### Number of shared clonotypes
The most straightforward yet a quite effective way to evaluate similarity of two clonesets is compute the number of shared clonotypes. *tcR* adds the new function `intersectClonesets` (`repOverlap(your_data, 'exact')`) which is by default computes the number of shared clonotypes using the "CDR3.nucleotide.sequence" columns of the given data frames, but user can change target columns by using arguments *.type* or *.col*. As in the `find.clonotypes`, user can choose which method apply to the elements: exact match of elements, match by Hamming distance or match by Levenshtein distance. Logical argument *.norm* is used to perform normalisation of the number of shared clonotypes by dividing this number by multiplication of clonesets' sizes (**strongly** recommended otherwise your results will be correlating with clonesets' sizes).
```{r eval=TRUE, echo=T, fig.align='center', warning=FALSE}
# Equivalent to intersect(twb[[1]]$CDR3.nucleotide.sequence,
# twb[[2]]$CDR3.nucleotide.sequence)
repOverlap(twb[1:2], 'exact', 'nuc', .verbose = F)
# Equivalent to intersectClonesets(twb, "n0e", .norm = T)
repOverlap(twb, 'exact', 'nuc', .norm = T, .verbose = F)
# Intersect by amino acid clonotypes + V genes
repOverlap(twb, 'exact', 'aa', .vgene = T, .verbose = F)
# Plot a heatmap of the number of shared clonotypes.
vis.heatmap(repOverlap(twb, 'exact', 'aa', .vgene = T, .verbose = F), .title = 'twb - (ave)-intersection', .labs = '')
```
See the `vis.heatmap` function in the Section "Visualisation" for the visualisation of the intersection results.
Functions `intersectCount`, `intersectLogic` and `intersectIndices` are more flexible in terms of choosing which columns to match. They all have parameter *.col* that specifies names of columns which will used in computing intersection. Function `intersectCount` returns number of similar elements; `intersectIndices(x, y)` returns 2-column matrix with the first column stands for an index of an element in the given *x*, and the second column stands for an index of that element of *y* which is similar to a relative element in *x*; `intersectLogic(x, y)` returns a logical vector of *length(x)* or *nrow(x)*, where TRUE at position *i* means that element with index {i} has been found in the *y*.
```{r eval=TRUE, echo=TRUE}
# Get logic vector of shared elements, where
# elements are tuples of CDR3 nucleotide sequence and corresponding V-segment
imm.1.2 <- intersectLogic(twb[[1]], twb[[2]],
.col = c('CDR3.amino.acid.sequence', 'V.gene'))
# Get elements which are in both twb[[1]] and twb[[2]].
head(twb[[1]][imm.1.2, c('CDR3.amino.acid.sequence', 'V.gene')])
```
#### "Top cross"
Number of shared clonotypes among the most abundant clonotypes may differ signigicantly from those with lesses count. To support research *tcR* offers the `top.cross` function, that apply `tcR::intersectClonesets`, e.g., to the first 1000 clonotypes, 2000, 3000 and so on up to the first 100000 clones, if supplied `.n == seq(1000, 100000, 1000)`.
```{r eval=TRUE, echo=T, fig.align='center', fig.height=6.5, fig.width=10, warning=FALSE}
twb.top <- top.cross(.data = twb, .n = seq(500, 10000, 500), .verbose = F, .norm = T)
top.cross.plot(twb.top)
```
#### More complex similarity measures
*tcR* also provides more complex similarity measures for evaluating the similarity of sets.
- Tversky index (`repOverlap(your_data, 'tversky')` for clonesets or `tversky.index` for vectors) is an asymmetric similarity measure on sets that compares a variant to a prototype. If using default arguments, it's similar to Dice's coefficient.
- Overlap coefficient (`repOverlap(your_data, 'overlap')` for clonesets or `overlap.coef` for vectors) is a similarity measure that measures the overlap between two sets, and is defined as the size of the intersection divided by the smaller of the size of the two sets.
- Jaccard index (`repOverlap(your_data, 'jaccard')` for clonesets or `jaccard.index` for vectors) is a statistic used for comparing the similarity and diversity of sample sets.
- Morisita's overlap index (`repOverlap(your_data, 'morisita')` for clonesets or `morisitas.index` for other data) is a statistical measure of dispersion of individuals in a population and is used to compare overlap among samples. The formula is based on the assumption that increasing the size of the samples will increase the diversity because it will include different habitats (i.e. different faunas) (Morisita, 1959).
To visualise similarity among repertoires the `vis.heatmap` function is appropriate.
```{r eval=TRUE, echo=TRUE, results='hold'}
# Apply the Morisitas overlap index to the each pair of repertoires.
# Use information about V genes (i.e. one CDR3 clonotype is equal to another
# if and only if their CDR3 aa sequences are equal and their V genes are equal)
repOverlap(twb, 'morisita', 'aa', 'read.count', .vgene = T, .verbose = F)
```
### Overlap statistics and tests
#### Overlap Z-score (OZ-score) - a measure for ???abnormality??? in overlaps
`ozScore`
#### Monte Carlo permutation test for pairwise and one-vs-all-wise within- and inter-group differences in a set of repertoires
`permutDistTest`
`pca2euclid`
### Shared repertoire
To investigate a shared among a several repertoires clonotypes ("shared repertoire") the package provided the `shared.repertoire` function along with functions for computing the shared repertoire statistics. The `shared.representation` function computes the number of shared clonotypes for each repertoire for each degree of sharing (i.e., number of people, in which indicated amount of clones have been found). The function `shared.summary` is equivalent to `repOverlap(, 'exact')` but applies to the shared repertoire data frame. Measuring distances among repertoires using the cosine similarity on vector of counts of shared sequences is also possible with the `cosine.sharing` function.
```{r eval=TRUE, echo=TRUE}
# Compute shared repertoire of amino acid CDR3 sequences and V genes
# which has been found in two or more people and return the Read.count column
# of such clonotypes from each data frame in the input list.
imm.shared <- shared.repertoire(.data = twb, .type = 'avrc', .min.ppl = 2, .verbose = F)
head(imm.shared)
shared.representation(imm.shared) # Number of shared sequences.
```
## Diversity evaluation
For assessing the distribution of clonotypes in the given repertoire, *tcR* provides functions for evaluating the diversity (functions `diversity` and `inverse.simpson`) and the skewness of the clonal distribution (functions `gini` and `gini.simpson`), and a general interface to all of this functions `repDiversity`, which user should use to estimate the diversity of clonesets. Function `diversity` (`repDiversity(your_clonesets, "div")`) computes the ecological diversity index (with parameter `.q` for penalties for clones with large count). Function `inverse.simpson` (`repDiversity(your_clonesets, "inv.simp")`) computes the Inverse Simpson Index (i.e., inverse probability of choosing two similar clonotypes). Function `gini` (`repDiversity(your_clonesets, "gini")`) computes the economical Gini index of clonal distribution. Function `gini.simpson` (`repDiversity(your_clonesets, "gini.simp")`) computes the Gini-Simpson index. Function `chao1` (`repDiversity(your_clonesets, "chao1")`) computes the Chao1 index, its SD and two 95 perc CI. Function `repDiversity` accepts single clonesets as well as a list of clonesets. Parameter `.quant` specifies which column to use for computing the diversity (print `?repDiversity` to see more information about input arguments).
```{r eval=TRUE, echo=TRUE, results='hold'}
# Evaluate the diversity of clones by the ecological diversity index.
repDiversity(twb, 'div', 'read.count')
sapply(twb, function (x) diversity(x$Read.count))
```
```{r eval=TRUE, echo=TRUE, results='hold'}
# Compute the diversity as the inverse probability of choosing two similar clonotypes.
repDiversity(twb, 'inv.simp', 'read.prop')
sapply(twb, function (x) inverse.simpson(x$Read.proportion))
```
```{r eval=TRUE, echo=TRUE, results='hold'}
# Evaluate the skewness of clonal distribution.
repDiversity(twb, 'gini.simp', 'read.prop')
sapply(twb, function (x) gini.simpson(x$Read.proportion))
```
```{r eval=TRUE, echo=TRUE, results='hold'}
# Compute diversity of repertoire using Chao index.
repDiversity(twb, 'chao1', 'read.count')
sapply(twb, function (x) chao1(x$Read.count))
```
## Visualisation
### CDR3 length and read count distributions plot
Plots of the distribution of CDR3 nucleotide sequences length (function `vis.count.len`) and the histogram of counts (function `vis.number.count`). Input data is either a data frame or a list with data frames. Argument *.col* specifies column's name with clonotype counts. Argument *.ncol* specifies a number of columns in a plot with multiple distribution, i.e., if the input data is a list with data frames.
```{r eval=TRUE, echo=TRUE, fig.height=4, fig.width=5.5, fig.align='center'}
vis.count.len(twb[[1]], .name = "twb[[1]] CDR3 lengths",
.col = "Read.count")
```
```{r eval=TRUE, echo=TRUE, fig.height=4, fig.width=5.5, fig.align='center', warning=FALSE, message=FALSE}
# I comment this to avoid a strange bug in ggplot2. Will uncomment later.
# vis.number.count(twb[[1]], .name = "twb[[1]] count distribution")
```
### Top proportions bar plot
For the visualisation of proportions of the most abundant clonotypes in a repertoire *tcR* offers the `vis.top.proportions` function. As input the function receives either data frame or a list with data frames (argument *.data*), an integer vector with number of clonotypes for computing proportions of count for this clonotypes (argument *.head*), and a column's name with clonotype counts (argument *.col*).
```{r echo=TRUE, eval=TRUE, fig.height=4, fig.width=5.5, message=FALSE, fig.align='center'}
vis.top.proportions(twb, c(10, 500, 3000, 10000), .col = "Read.count")
```
### Clonal space homeostasis bar plot
For the visualisation of how much space occupied each group of clonotypes, divided into groups by their proportions in the data, use the `vis.clonal.space` function. As an input it receives the output of the `clonal.space.homeostasis` function.
```{r eval=TRUE, echo=TRUE, fig.height=4, fig.width=6.5, fig.align='center'}
twb.space <- clonal.space.homeostasis(twb)
vis.clonal.space(twb.space)
```
### Heat map
Pairwise distances or similarity of repertoires can be represented as qudratic matrices, in which each row and column represented a cloneset, and each value in every cell (i, j) is a distance between repertoires with indices i and j. One way to visalise such matrices is using "heatmaps". For plotting heatmaps in *tcR* implemented the `vis.heatmap` function. With changing input arguments user can change names of labs, title and legend.
```{r eval=TRUE, echo=TRUE, fig.align='center', warning=FALSE, message=FALSE}
twb.shared <- repOverlap(twb, "exact", .norm = F, .verbose = F)
vis.heatmap(twb.shared, .title = "Twins shared nuc clonotypes",
.labs = c("Sample in x", "Sample in y"), .legend = "# clonotypes")
```
### Radar-like plot
Another way to repsent distances among objects is "radar-like" plots (because this plots is not exactly radar plots) realised in *tcR* throught the `vis.radarlike` function. Argument *.ncol* specifies a number of columns of radar-like plots in a viewport.
```{r eval=T, echo=TRUE, fig.align='center'}
twb.js <- js.div.seg(twb, HUMAN_TRBV, .verbose = F)
vis.radarlike(twb.js, .ncol = 2)
```
### Gene usage histogram
For the visualisation of gene usage *tcR* employes subroutines for making classical histograms using the `vis.gene.usage` function. The function accept clonesets, lists of clonesets or output from the `geneUsage` function. If input is a cloneset(s), then user should specify a gene alphabet (e.g., `HUMAN_TRBV`) in order to compute the gene usage. Using a parameter \code{.dodge}, user can change type of the output between an output as histograms for each cloneset in the input list (`.dodge = F`) or an output as an one histogram for all data, which is very useful for comparing distribution of genes (`.dodge = T`). If `.dodge=F` and input are lists of clonesets or a gene usage of a few clonesets, than user with argument `.ncol` can specify how many columns of histograms will be outputted. With `.coord.flip` user can flip coordinates so genes will be at the left side of the plot.
```{r eval=TRUE, echo=TRUE, message=FALSE, fig.align='center', fig.height=5, fig.width=7}
vis.gene.usage(twb[[1]], HUMAN_TRBV, .main = 'Sample I V-usage')
```
```{r eval=TRUE, echo=TRUE, message=FALSE, fig.align='center', fig.height=7, fig.width=5}
vis.gene.usage(twb[[2]], HUMAN_TRBV, .main = 'Sample II V-usage', .coord.flip = T)
```
```{r eval=TRUE, echo=TRUE, message=FALSE, fig.align='center', fig.height=5, fig.width=7}
twb.jusage <- geneUsage(twb, HUMAN_TRBJ)
vis.gene.usage(twb.jusage, .main = 'Twins J-usage', .dodge = T)
```
```{r eval=TRUE, echo=TRUE, message=FALSE, fig.align='center', fig.height=6, fig.width=9}
vis.gene.usage(twb, HUMAN_TRBJ, .main = 'Twins J-usage', .dodge = F, .ncol = 2)
```
### PCA visualisation
For the visualisation of results from the `prcomp` function (i.e., objects of class `prcomp`), *tcR* provides the `vis.pca` function. Input arguments for the function are an object of class `prcomp` and a (if needed) list with groups (vectors of indices of samples) for colouring points in the plot.
```{r eval=TRUE, echo=TRUE, fig.align='center', fig.height=4.5, fig.width=6}
twb.pca <- pca.segments(twb, .do.plot = F)
vis.pca(pca.segments(twb, .do.plot = F, .genes = HUMAN_TRBV), .groups = list(GroupA = c(1,2), GroupB = c(3,4)))
```
### Logo-like plot
Logo-like graphs for visualisation of nucleotide or amino acid motif sequences / profiles.
```{r eval=TRUE, echo=TRUE, fig.align='center', fig.width=6, fig.height=5.5, warning=FALSE, message=FALSE}
km <- get.kmers(twb[[1]]$CDR3.amino.acid.sequence, .head = 100, .k = 7, .verbose = F)
d <- kmer.profile(km)
vis.logo(d)
```
## Mutation networks
Mutation network (or a mutation graph) is a graph with vertices representing nucleotide or in-frame amino acid sequences (out-of-frame amino acid sequences will be automatically filtered out by *tcR* functions for mutation network creating) and edges which connecting pairs of sequences with hamming distance (parameter *.method* = 'hamm') or edit distance (parameter *.method* = 'lev') between them no more than specified in the *.max.errors* function parameter of the `mutation.network` function. To create a mutation network first what you need is to make a shared repertoires and then apply the `mutation.network` function to this shared repertoire:
```{r eval=TRUE, echo=TRUE}
# data(twb)
twb.shared <- shared.repertoire(twb, .head = 1000, .verbose = F)
G <- mutation.network(twb.shared)
G
```
To manipulate vertex attributes functions \code{set.group.vector} and \code{get.group.names} are provided.
```{r eval=TRUE, echo=TRUE}
# data(twb)
# twb.shared <- shared.repertoire(twb, .head = 1000)
# G <- mutation.network(twb.shared)
G <- set.group.vector(G, "twins", list(A = c(1,2), B = c(3,4))) # <= refactor this
get.group.names(G, "twins", 1)
get.group.names(G, "twins", 300)
get.group.names(G, "twins", c(1,2,3), F)
get.group.names(G, "twins", 300, F)
# Because we have only two groups, we can assign more readable attribute.
V(G)$twin.names <- get.group.names(G, "twins")
V(G)$twin.names[1]
V(G)$twin.names[300]
```
To access neighbour vertices of vertices ("ego-network") use the \code{mutation.neighbours} function:
```{r eval=TRUE, echo=TRUE}
# data(twb)
# twb.shared <- shared.repertoire(twb, .head = 1000)
# G <- mutation.network(twb.shared)
head(mutated.neighbours(G, 1)[[1]])
```
## Conclusion
Feel free to contact me for the package-related or immunoinformatics research-related questions.
If you spot a bug or would like to see something useful for you in the package feel free to raise an issue at *tcR* GitHub: [https://github.com/imminfo/tcr/issues](Issues)
## Appendix A: Kmers manipulation and processing
In the package implemented functions for working with k-mers. Function `get.kmers` generates k-mers from the given chatacter vector or a data frame with columns for sequences and a count for each sequence.
```{r eval=TRUE, echo=TRUE}
head(get.kmers(twb[[1]]$CDR3.amino.acid.sequence, 100, .meat = F, .verbose = F))
head(get.kmers(twb[[1]], .meat = T, .verbose = F))
```
## Appendix B: Nucleotide and amino acid sequences manipulation
The package also provides a several number of functions for performing classic bioinformatics tasks on strings. For more powerful subroutines see the Bioconductor's *Biostrings* package.
### Nucleotide sequence manipulation
Functions for basic nucleotide sequences manipulations: reverse-complement, translation and GC-content computation. All functions are vectorised.
```{r eval=TRUE, echo=TRUE}
revcomp(c('AAATTT', 'ACGTTTGGA'))
cbind(bunch.translate(twb[[1]]$CDR3.nucleotide.sequence[1:10]),
twb[[1]]$CDR3.amino.acid.sequence[1:10])
gc.content(twb[[1]]$CDR3.nucleotide.sequence[1:10])
```
### Reverse translation subroutines
Function `codon.variants` returns a list of vectors of nucleotide codons for each letter for each input amino acid sequence. Function `translated.nucl.sequences` returns the number of nucleotide sequences, which, when translated, will result in the given amino acid sequence(s). Function `reverse.translation` return all nucleotide sequences, which is translated to the given amino acid sequences. Optional argument `.nucseq` for each of this function provides restriction for nucleotides, which cannot be changed. All functions are vectorised.
```{r eval=TRUE, echo=TRUE}
codon.variants('LQ')
translated.nucl.sequences(c('LQ', 'CASSLQ'))
reverse.translation('LQ')
translated.nucl.sequences('LQ', 'XXXXXG')
codon.variants('LQ', 'XXXXXG')
reverse.translation('LQ', 'XXXXXG')
```
tcR/inst/doc/tcrvignette.R 0000644 0001762 0000144 00000031730 13446161025 015213 0 ustar ligges users ## ----eval=TRUE,echo=FALSE,warning=FALSE,message=FALSE--------------------
library(tcR)
data(twa)
data(twb)
## ----eval=FALSE,echo=TRUE------------------------------------------------
# # Load the package and load the data.
# library(tcR)
# data(twa) # "twa" - list of length 4
# data(twb) # "twb" - list of length 4
#
# # Explore the data.
# head(twa[[1]])
# head(twb[[1]])
## ----eval=FALSE,echo=TRUE------------------------------------------------
# # Run help to see available alphabets.
# ?genealphabets
# ?genesegments
# data(genesegments)
## ----eval=FALSE,echo=TRUE------------------------------------------------
# # Parse file in "~/mitcr/immdata1.txt" as a MiTCR file.
# immdata1 <- parse.file("~/mitcr_data/immdata1.txt", 'mitcr')
# # equivalent to
# immdata1.eq <- parse.mitcr("~/mitcr_data/immdata1.txt")
#
# # Parse folder with MiGEC files.
# immdata <- parse.folder("~/migec_data/", 'migec')
## ----eval=TRUE, echo=TRUE------------------------------------------------
# No D genes is available here hence "" at "D.genes" and "-1" at positions.
str(twa[[1]])
str(twb[[1]])
## ----eval=TRUE,echo=TRUE-------------------------------------------------
cloneset.stats(twb)
## ----eval=TRUE,echo=TRUE-------------------------------------------------
repseq.stats(twb)
## ----eval=TRUE,echo=TRUE-------------------------------------------------
# How many clonotypes fill up approximately
clonal.proportion(twb, 25) # the 25% of the sum of values in 'Read.count'?
## ----echo=TRUE, eval=TRUE, fig=TRUE, fig.height=4, fig.width=5.5, message=FALSE, fig.align='center'----
# What accounts a proportion of the top-10 clonotypes' reads
top.proportion(twb, 10) # to the overall number of reads?
vis.top.proportions(twb) # Plot this proportions.
## ----eval=TRUE,echo=TRUE-------------------------------------------------
# What is a proportion of sequences which
# have 'Read.count' <= 100 to the
tailbound.proportion(twb, 100) # overall number of reads?
## ----eval=TRUE, echo=TRUE, fig.height=4, fig.width=6.5, fig.align='center'----
# data(twb)
# Compute summary space of clones, that occupy
# [0, .05) and [.05, 1] proportion.
clonal.space.homeostasis(twb, c(Low = .05, High = 1))
# Use default arguments:
clonal.space.homeostasis(twb[[1]])
twb.space <- clonal.space.homeostasis(twb)
vis.clonal.space(twb.space)
## ----eval=TRUE,echo=TRUE-------------------------------------------------
imm.in <- get.inframes(twb) # Return all in-frame sequences from the 'twb'.
# Count the number of out-of-frame sequences
count.outframes(twb, 5000) # from the first 5000 sequences.
## ----eval=TRUE,echo=TRUE-------------------------------------------------
imm.in <- get.frames(twb, 'in') # Similar to 'get.inframes(twb)'.
count.frames(twb[[1]], 'all') # Just return number of rows.
flag <- 'out'
count.frames(twb, flag, 5000) # Similar to 'count.outframes(twb, 5000)'.
## ----eval=TRUE,echo=TRUE-------------------------------------------------
cmv <- data.frame(CDR3.amino.acid.sequence = c('CASSSANYGYTF', 'CSVGRAQNEQFF', 'CASSLTGNTEAFF', 'CASSALGGAGTGELFF', 'CASSLIGVSSYNEQFF'),
V.genes = c('TRBV4-1', 'TRBV4-1', 'TRBV4-1', 'TRBV4-1', 'TRBV4-1'), stringsAsFactors = F)
cmv
## ----eval=TRUE,echo=TRUE-------------------------------------------------
twb <- set.rank(twb)
# Case 1.
cmv.imm.ex <-
find.clonotypes(.data = twb[1:2], .targets = cmv[,1], .method = 'exact',
.col.name = c('Read.count', 'Total.insertions'),
.verbose = F)
head(cmv.imm.ex)
# Case 2.
# Search for CDR3 sequences with hamming distance <= 1
# to the one of the cmv$CDR3.amino.acid.sequence with
# matching V genes. Return ranks of found sequences.
cmv.imm.hamm.v <-
find.clonotypes(twb[1:3], cmv, 'hamm', 'Rank',
.target.col = c('CDR3.amino.acid.sequence',
'V.gene'),
.verbose = F)
head(cmv.imm.hamm.v)
# Case 3.
# Similar to the previous example, except
# using levenshtein distance and the "Read.count" column.
cmv.imm.lev.v <-
find.clonotypes(twb[1:3], cmv, 'lev',
.target.col = c('CDR3.amino.acid.sequence', 'V.gene'),
.verbose = F)
head(cmv.imm.lev.v)
## ----eval=TRUE,echo=TRUE-------------------------------------------------
imm1.vs <- geneUsage(twb[[1]], HUMAN_TRBV)
head(imm1.vs)
imm.vs.all <- geneUsage(twb, HUMAN_TRBV)
imm.vs.all[1:10, 1:4]
# Compute joint V-J counts
imm1.vj <- geneUsage(twb[[1]], list(HUMAN_TRBV, HUMAN_TRBJ))
imm1.vj[1:5, 1:5]
## ----eval=TRUE, echo=TRUE, message=FALSE, fig.align='center', fig.height=5, fig.width=7----
# Put ".dodge = F" to get distinct plot for every data frame in the given list.
vis.gene.usage(twb, HUMAN_TRBJ, .main = 'twb J-usage dodge', .dodge = T)
## ----eval=TRUE, echo=TRUE, message=FALSE, fig.align='center', fig.height=6, fig.width=9----
vis.gene.usage(twb, HUMAN_TRBJ, .main = 'twb J-usage column', .dodge = F, .ncol = 2)
## ----eval=TRUE, echo=TRUE, message=FALSE, fig.align='center', fig.height=5, fig.width=7----
vis.gene.usage(imm1.vs, NA, .main = 'twb[[1]] V-usage', .coord.flip = F)
## ----eval=T, echo=TRUE, fig.align='center'-------------------------------
# Transform "0:100" to distribution with Laplace correction
entropy(0:100, .laplace = 1) # (i.e., add "1" to every value before transformation).
entropy.seg(twb, HUMAN_TRBV) # Compute entropy of V-segment usage for each data frame.
js.div.seg(twb[1:2], HUMAN_TRBV, .verbose = F)
imm.js <- js.div.seg(twb, HUMAN_TRBV, .verbose = F)
vis.radarlike(imm.js, .ncol = 2)
## ----eval=TRUE, echo=TRUE, fig.align='center', fig.height=4.5, fig.width=6----
pca.segments(twb, .genes = HUMAN_TRBV) # Plot PCA results of V-segment usage.
# Return object of class "prcomp"
class(pca.segments(twb, .do.plot = F, .genes = HUMAN_TRBV))
## ----eval=TRUE, echo=T, fig.align='center', warning=FALSE----------------
# Equivalent to intersect(twb[[1]]$CDR3.nucleotide.sequence,
# twb[[2]]$CDR3.nucleotide.sequence)
repOverlap(twb[1:2], 'exact', 'nuc', .verbose = F)
# Equivalent to intersectClonesets(twb, "n0e", .norm = T)
repOverlap(twb, 'exact', 'nuc', .norm = T, .verbose = F)
# Intersect by amino acid clonotypes + V genes
repOverlap(twb, 'exact', 'aa', .vgene = T, .verbose = F)
# Plot a heatmap of the number of shared clonotypes.
vis.heatmap(repOverlap(twb, 'exact', 'aa', .vgene = T, .verbose = F), .title = 'twb - (ave)-intersection', .labs = '')
## ----eval=TRUE, echo=TRUE------------------------------------------------
# Get logic vector of shared elements, where
# elements are tuples of CDR3 nucleotide sequence and corresponding V-segment
imm.1.2 <- intersectLogic(twb[[1]], twb[[2]],
.col = c('CDR3.amino.acid.sequence', 'V.gene'))
# Get elements which are in both twb[[1]] and twb[[2]].
head(twb[[1]][imm.1.2, c('CDR3.amino.acid.sequence', 'V.gene')])
## ----eval=TRUE, echo=T, fig.align='center', fig.height=6.5, fig.width=10, warning=FALSE----
twb.top <- top.cross(.data = twb, .n = seq(500, 10000, 500), .verbose = F, .norm = T)
top.cross.plot(twb.top)
## ----eval=TRUE, echo=TRUE, results='hold'--------------------------------
# Apply the Morisitas overlap index to the each pair of repertoires.
# Use information about V genes (i.e. one CDR3 clonotype is equal to another
# if and only if their CDR3 aa sequences are equal and their V genes are equal)
repOverlap(twb, 'morisita', 'aa', 'read.count', .vgene = T, .verbose = F)
## ----eval=TRUE, echo=TRUE------------------------------------------------
# Compute shared repertoire of amino acid CDR3 sequences and V genes
# which has been found in two or more people and return the Read.count column
# of such clonotypes from each data frame in the input list.
imm.shared <- shared.repertoire(.data = twb, .type = 'avrc', .min.ppl = 2, .verbose = F)
head(imm.shared)
shared.representation(imm.shared) # Number of shared sequences.
## ----eval=TRUE, echo=TRUE, results='hold'--------------------------------
# Evaluate the diversity of clones by the ecological diversity index.
repDiversity(twb, 'div', 'read.count')
sapply(twb, function (x) diversity(x$Read.count))
## ----eval=TRUE, echo=TRUE, results='hold'--------------------------------
# Compute the diversity as the inverse probability of choosing two similar clonotypes.
repDiversity(twb, 'inv.simp', 'read.prop')
sapply(twb, function (x) inverse.simpson(x$Read.proportion))
## ----eval=TRUE, echo=TRUE, results='hold'--------------------------------
# Evaluate the skewness of clonal distribution.
repDiversity(twb, 'gini.simp', 'read.prop')
sapply(twb, function (x) gini.simpson(x$Read.proportion))
## ----eval=TRUE, echo=TRUE, results='hold'--------------------------------
# Compute diversity of repertoire using Chao index.
repDiversity(twb, 'chao1', 'read.count')
sapply(twb, function (x) chao1(x$Read.count))
## ----eval=TRUE, echo=TRUE, fig.height=4, fig.width=5.5, fig.align='center'----
vis.count.len(twb[[1]], .name = "twb[[1]] CDR3 lengths",
.col = "Read.count")
## ----eval=TRUE, echo=TRUE, fig.height=4, fig.width=5.5, fig.align='center', warning=FALSE, message=FALSE----
# I comment this to avoid a strange bug in ggplot2. Will uncomment later.
# vis.number.count(twb[[1]], .name = "twb[[1]] count distribution")
## ----echo=TRUE, eval=TRUE, fig.height=4, fig.width=5.5, message=FALSE, fig.align='center'----
vis.top.proportions(twb, c(10, 500, 3000, 10000), .col = "Read.count")
## ----eval=TRUE, echo=TRUE, fig.height=4, fig.width=6.5, fig.align='center'----
twb.space <- clonal.space.homeostasis(twb)
vis.clonal.space(twb.space)
## ----eval=TRUE, echo=TRUE, fig.align='center', warning=FALSE, message=FALSE----
twb.shared <- repOverlap(twb, "exact", .norm = F, .verbose = F)
vis.heatmap(twb.shared, .title = "Twins shared nuc clonotypes",
.labs = c("Sample in x", "Sample in y"), .legend = "# clonotypes")
## ----eval=T, echo=TRUE, fig.align='center'-------------------------------
twb.js <- js.div.seg(twb, HUMAN_TRBV, .verbose = F)
vis.radarlike(twb.js, .ncol = 2)
## ----eval=TRUE, echo=TRUE, message=FALSE, fig.align='center', fig.height=5, fig.width=7----
vis.gene.usage(twb[[1]], HUMAN_TRBV, .main = 'Sample I V-usage')
## ----eval=TRUE, echo=TRUE, message=FALSE, fig.align='center', fig.height=7, fig.width=5----
vis.gene.usage(twb[[2]], HUMAN_TRBV, .main = 'Sample II V-usage', .coord.flip = T)
## ----eval=TRUE, echo=TRUE, message=FALSE, fig.align='center', fig.height=5, fig.width=7----
twb.jusage <- geneUsage(twb, HUMAN_TRBJ)
vis.gene.usage(twb.jusage, .main = 'Twins J-usage', .dodge = T)
## ----eval=TRUE, echo=TRUE, message=FALSE, fig.align='center', fig.height=6, fig.width=9----
vis.gene.usage(twb, HUMAN_TRBJ, .main = 'Twins J-usage', .dodge = F, .ncol = 2)
## ----eval=TRUE, echo=TRUE, fig.align='center', fig.height=4.5, fig.width=6----
twb.pca <- pca.segments(twb, .do.plot = F)
vis.pca(pca.segments(twb, .do.plot = F, .genes = HUMAN_TRBV), .groups = list(GroupA = c(1,2), GroupB = c(3,4)))
## ----eval=TRUE, echo=TRUE, fig.align='center', fig.width=6, fig.height=5.5, warning=FALSE, message=FALSE----
km <- get.kmers(twb[[1]]$CDR3.amino.acid.sequence, .head = 100, .k = 7, .verbose = F)
d <- kmer.profile(km)
vis.logo(d)
## ----eval=TRUE, echo=TRUE------------------------------------------------
# data(twb)
twb.shared <- shared.repertoire(twb, .head = 1000, .verbose = F)
G <- mutation.network(twb.shared)
G
## ----eval=TRUE, echo=TRUE------------------------------------------------
# data(twb)
# twb.shared <- shared.repertoire(twb, .head = 1000)
# G <- mutation.network(twb.shared)
G <- set.group.vector(G, "twins", list(A = c(1,2), B = c(3,4))) # <= refactor this
get.group.names(G, "twins", 1)
get.group.names(G, "twins", 300)
get.group.names(G, "twins", c(1,2,3), F)
get.group.names(G, "twins", 300, F)
# Because we have only two groups, we can assign more readable attribute.
V(G)$twin.names <- get.group.names(G, "twins")
V(G)$twin.names[1]
V(G)$twin.names[300]
## ----eval=TRUE, echo=TRUE------------------------------------------------
# data(twb)
# twb.shared <- shared.repertoire(twb, .head = 1000)
# G <- mutation.network(twb.shared)
head(mutated.neighbours(G, 1)[[1]])
## ----eval=TRUE, echo=TRUE------------------------------------------------
head(get.kmers(twb[[1]]$CDR3.amino.acid.sequence, 100, .meat = F, .verbose = F))
head(get.kmers(twb[[1]], .meat = T, .verbose = F))
## ----eval=TRUE, echo=TRUE------------------------------------------------
revcomp(c('AAATTT', 'ACGTTTGGA'))
cbind(bunch.translate(twb[[1]]$CDR3.nucleotide.sequence[1:10]),
twb[[1]]$CDR3.amino.acid.sequence[1:10])
gc.content(twb[[1]]$CDR3.nucleotide.sequence[1:10])
## ----eval=TRUE, echo=TRUE------------------------------------------------
codon.variants('LQ')
translated.nucl.sequences(c('LQ', 'CASSLQ'))
reverse.translation('LQ')
translated.nucl.sequences('LQ', 'XXXXXG')
codon.variants('LQ', 'XXXXXG')
reverse.translation('LQ', 'XXXXXG')
tcR/inst/doc/tcrvignette.html 0000644 0001762 0000144 00017345101 13446161025 015766 0 ustar ligges users
tcR: a package for T cell receptor and Immunoglobulin repertoires advanced data analysis
The tcR package designed to help researchers in the immunology field to analyse T cell receptor (TCR) and immunoglobulin (Ig) repertoires. In this vignette, I will cover procedures for immune receptor repertoire analysis provided with the package.
Terms:
Clonotype: a group of T / B cell clones with equal CDR3 nucleotide sequences and equal Variable genes.
Cloneset / repertoire: a set of clonotypes. Represented as a data frame in which each row corresponds to a unique clonotype.
UMI: Unique Molecular Identifier (see this paper for details)
Package features
Parsers for outputs of various tools for CDR3 extraction and genes alignment (currently implemented parsers for MiTCR, MiGEC, VDJtools, ImmunoSEQ, IMSEQ and MiXCR)
Data manipulation (in-frame / out-of-frame sequences subsetting, clonotype motif search)
Descriptive statistics (number of reads, number of clonotypes, gene segment usage)
Shared clonotypes statistics (number of shared clonotypes, using V genes or not; sequential intersection among the most abundant clonotype (“top-crossâ€))
Artificial repertoire generation (beta chain only, for now)
Spectratyping
Various visualisation procedures
Mutation networks (graphs, in which vertices represent CDR3 nucleotide / amino acid sequences and edges are connecting similar sequences with low hamming or edit distance between them)
Data in the package
There are two datasets provided with the package - twins data and V(D)J recombination genes data.
Downsampled twins data
twa.rda, twb.rda - two lists with 4 data frames in each list. Every data frame is a sample downsampled to the 10000 most abundant clonotypes of twins data (alpha and beta chains). Full data is available here:
# Load the package and load the data.
library(tcR)
data(twa) # "twa" - list of length 4
data(twb) # "twb" - list of length 4
# Explore the data.
head(twa[[1]])
head(twb[[1]])
Gene alphabets
Gene alphabets - character vectors with names of genes for TCR and Ig.
# Run help to see available alphabets.
?genealphabets
?genesegments
data(genesegments)
Quick start / automatic report generation
For the exploratory analysis of a single repertoire, use the RMarkdown report file at
"<path to the tcR package>/inst/library.report.Rmd"
Analysis in the file include statistics and visualisation of number of clones, clonotypes, in- and out-of-frame sequences, unique amino acid CDR3 sequences, V- and J-usage, most frequent k-mers, rarefaction analysis.
For the analysis of a group of repertoires (“cross-analysisâ€), use the RMarkdown report file at:
"<path to the tcR package>/inst/crossanalysis.report.Rmd}"
Analysis in this file include statistics and visualisation of number of shared clones and clonotypes, V-usage for individuals and groups, J-usage for individuals, Jensen-Shannon divergence among V-usages of repertoires and top-cross.
You need the knitr package installed in order to generate reports from default pipelines. In RStudio you can run a pipeline file as follows:
Run RStudio -> load the pipeline .Rmd files -> press the knitr button
Input parsing
Currently in tcR there are implemented parser for the next software:
MiTCR - parse.mitcr;
MiTCR w/ UMIs - parse.mitcrbc;
MiGEC - parse.migec;
VDJtools - parse.vdjtools;
ImmunoSEQ - parse.immunoseq;
MiXCR - parse.mixcr;
IMSEQ - parse.imseq.
Also a general parser parse.cloneset for a text table files is implemented. General wrapper for parsers is parse.file. User can also parse a list of files or the entire folder. Run ?parse.folder to see a help on parsing input files and a list of functions for parsing a specific input format.
# Parse file in "~/mitcr/immdata1.txt" as a MiTCR file.
immdata1 <- parse.file("~/mitcr_data/immdata1.txt", 'mitcr')
# equivalent to
immdata1.eq <- parse.mitcr("~/mitcr_data/immdata1.txt")
# Parse folder with MiGEC files.
immdata <- parse.folder("~/migec_data/", 'migec')
Cloneset representation
Clonesets represented in tcR as data frames with each row corresponding to the one nucleotide clonotype and with specific column names:
V.end - last positions of aligned V genes (1-based);
J.start - first positions of aligned J genes (1-based);
D5.end - positions of D’5 end of aligned D genes (1-based);
D3.end - positions of D’3 end of aligned D genes (1-based);
VD.insertions - number of inserted nucleotides (N-nucleotides) at V-D junction (-1 for receptors with VJ recombination);
DJ.insertions - number of inserted nucleotides (N-nucleotides) at D-J junction (-1 for receptors with VJ recombination);
Total.insertions - total number of inserted nucleotides (number of N-nucleotides at V-J junction for receptors with VJ recombination).
Any data frame with this columns and of this class is suitable for processing with the package, hence user can generate their own table files and load them for the further analysis using read.csv, read.table and other base R functions. Please note that tcR internally expects all strings to be of class “characterâ€, not “factorâ€. Therefore you should use R parsing functions with parameter stringsAsFactors=FALSE.
# No D genes is available here hence "" at "D.genes" and "-1" at positions.
str(twa[[1]])
## 'data.frame': 10000 obs. of 16 variables:
## $ Umi.count : logi NA NA NA NA NA NA ...
## $ Umi.proportion : logi NA NA NA NA NA NA ...
## $ Read.count : int 147158 58018 55223 41341 31525 30682 20913 16695 14757 13745 ...
## $ Read.proportion : num 0.0857 0.0338 0.0322 0.0241 0.0184 ...
## $ CDR3.nucleotide.sequence: chr "TGTGCTGTGATGGATAGCAACTATCAGTTAATCTGG" "TGTGCAGAGAAGTCTAGCAACACAGGCAAACTAATCTTT" "TGTGTGGTGAACTTTACTGGAGGCTTCAAAACTATCTTT" "TGTGCAGCAAGTATAGGGCTTGTATCTAACTTTGGAAATGAGAAATTAACCTTT" ...
## $ CDR3.amino.acid.sequence: chr "CAVMDSNYQLIW" "CAEKSSNTGKLIF" "CVVNFTGGFKTIF" "CAASIGLVSNFGNEKLTF" ...
## $ V.gene : chr "TRAV1-2" "TRAV5" "TRAV12-1" "TRAV13-1" ...
## $ J.gene : chr "TRAJ33" "TRAJ37" "TRAJ9" "TRAJ48" ...
## $ D.gene : chr "" "" "" "" ...
## $ V.end : int 9 9 11 12 9 8 4 12 9 12 ...
## $ J.start : int 10 12 14 22 19 9 15 13 15 14 ...
## $ D5.end : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ D3.end : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ VD.insertions : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ DJ.insertions : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ Total.insertions : int 0 2 2 9 9 0 10 0 5 1 ...
For the exploratory analysis tcR provides various functions for computing descriptive statistics.
Cloneset summary
To get a general view of a subject’s repertoire (overall count of sequences, in- and out-of-frames numbers and proportions) use the cloneset.stats function. It returns a summary of counts of nucleotide and amino acid clonotypes, as well as summary of read counts:
Function clonal.proportion is used to get the number of most abundant by the count of reads clonotypes. E.g., compute number of clonotypes which fill up (approx.) the 25% from total repertoire’s “Read.countâ€:
# How many clonotypes fill up approximately
clonal.proportion(twb, 25) # the 25% of the sum of values in 'Read.count'?
Function tailbound.proportion with two arguments .col and .bound gets subset of the given data frame with clonotypes which have column .col with value \(\leq\).bound and computes the ratio of sums of count reads of such subset to the overall data frame. E.g., get proportion of sum of reads of sequences which has “Read.count†<= 100 to the overall number of reads:
# What is a proportion of sequences which
# have 'Read.count' <= 100 to the
tailbound.proportion(twb, 100) # overall number of reads?
Functions for performing subsetting and counting number of in-frame and out-of-frame clonotypes are: count.inframes, count.outframes, get.inframes, get.outframes. Parameter .head for this functions is a parameter to the .head function, that applied to the input data frame or an input list of data frames before subsetting. Functions accept both data frames and list of data frames as parameters. E.g., get data frame with only in-frame sequences and count out-of-frame sequences in the first 5000 rows for this data frame:
imm.in <- get.inframes(twb) # Return all in-frame sequences from the 'twb'.
# Count the number of out-of-frame sequences
count.outframes(twb, 5000) # from the first 5000 sequences.
## Subj.A Subj.B Subj.C Subj.D
## 172 212 73 326
General functions with parameter stands for ‘all’ (all sequences), ‘in’ (only in-frame sequences) or ‘out’ (only out-of-frame sequences) are get.frames and count.frames:
imm.in <- get.frames(twb, 'in') # Similar to 'get.inframes(twb)'.
count.frames(twb[[1]], 'all') # Just return number of rows.
## [1] 10000
flag <- 'out'
count.frames(twb, flag, 5000) # Similar to 'count.outframes(twb, 5000)'.
## Subj.A Subj.B Subj.C Subj.D
## 172 212 73 326
Search for a target CDR3 sequences
For exact or fuzzy search of sequences the package employed a function find.clonotypes. Input arguments for this function are a data frame or a list of data frames, targets (a character vector or data frame having one column with sequences and additional columns with, e.g., V genes), a value of which column or columns to return, a method to be used to compare sequences among each other (either “exact†for exact matching, “hamm†for matching sequences by Hamming distance (two sequences are matched if H \(\leq\) 1) or “lev†for matching sequences by Levenshtein distance (two sequences are matched if L \(\leq\) 1)), and column name from which sequences for matching are obtained. Sounds very complex, but in practice it’s very easy, therefore let’s go to examples.
Suppose we want to search for some CDR3 sequences in a number of repertoires:
We will search for them using all methods of matching (exact, hamming or levenshtein) and with and without matching by V-segment. Also, for the first case (exact matching and without V gene) we return “Total.insertions†column along with the “Read.count†column, and for the second case output will be a “Rank†- rank (generated by set.rank) of a clone or a clonotype in a data frame.
# Case 2.
# Search for CDR3 sequences with hamming distance <= 1
# to the one of the cmv$CDR3.amino.acid.sequence with
# matching V genes. Return ranks of found sequences.
cmv.imm.hamm.v <-
find.clonotypes(twb[1:3], cmv, 'hamm', 'Rank',
.target.col = c('CDR3.amino.acid.sequence',
'V.gene'),
.verbose = F)
head(cmv.imm.hamm.v)
## CDR3.amino.acid.sequence V.gene Rank.Subj.A Rank.Subj.B
## CASSALGGAGTGELFF CASSALGGAGTGELFF TRBV4-1 NA NA
## CASSLIGVSSYNEQFF CASSLIGVSSYNEQFF TRBV4-1 NA NA
## CASSLTGNTEAFF CASSLTGNTEAFF TRBV4-1 NA NA
## CASSSANYGYTF CASSSANYGYTF TRBV4-1 NA NA
## CSVGRAQNEQFF CSVGRAQNEQFF TRBV4-1 NA NA
## Rank.Subj.C
## CASSALGGAGTGELFF NA
## CASSLIGVSSYNEQFF NA
## CASSLTGNTEAFF NA
## CASSSANYGYTF NA
## CSVGRAQNEQFF NA
# Case 3.
# Similar to the previous example, except
# using levenshtein distance and the "Read.count" column.
cmv.imm.lev.v <-
find.clonotypes(twb[1:3], cmv, 'lev',
.target.col = c('CDR3.amino.acid.sequence', 'V.gene'),
.verbose = F)
head(cmv.imm.lev.v)
## CDR3.amino.acid.sequence V.gene Read.count.Subj.A
## CASSALGGAGTGELFF CASSALGGAGTGELFF TRBV4-1 NA
## CASSLIGVSSYNEQFF CASSLIGVSSYNEQFF TRBV4-1 NA
## CASSLTGNTEAFF CASSLTGNTEAFF TRBV4-1 NA
## CASSSANYGYTF CASSSANYGYTF TRBV4-1 NA
## CSVGRAQNEQFF CSVGRAQNEQFF TRBV4-1 NA
## Read.count.Subj.B Read.count.Subj.C
## CASSALGGAGTGELFF NA NA
## CASSLIGVSSYNEQFF NA NA
## CASSLTGNTEAFF NA NA
## CASSSANYGYTF NA NA
## CSVGRAQNEQFF NA NA
Gene usage
Variable and Joining gene usage (V-usage and J-usage) are important characteristics of repertoires. To access and compare them among repertoires tcR provides a few useful functions.
Gene usage computing
To access V- and J-usage of a repertoire tcR provides functions geneUsage. Function geneUsage, depending on parameters, computes frequencies or counts of the given elements (e.g., V genes) of the input data frame or the input list of data frames. V and J gene names for humans for TCR and Ig are stored in the .rda file genesegments.rda (they are identical to those form IMGT: and ). All of the mentioned functions are accept data frames as well as list of data frames. Output for those functions are data frames with the first column stands for a gene and the other for frequencies.
You can also directly visualise gene usage with the function vis.gene.usage (if you pass the gene alphabet as a second argument):
# Put ".dodge = F" to get distinct plot for every data frame in the given list.
vis.gene.usage(twb, HUMAN_TRBJ, .main = 'twb J-usage dodge', .dodge = T)
To evaluate V- and J genes usage of repertoires, the package implements subroutines for two approaches to the analysis: measures from the information theory and PCA (Principal Component Analysis).
Shannon entropy and Jensen-Shannon divergence
To assess the diversity of genes usage user can use the entropy function. Kullback-Leibler assymetric measure (function kl.div) and Jensen-Shannon symmetric measure (functions js.div for computing JS-divergence between the given distributions and js.div.seg for computing JS-divergence between genes distributions of two clonesets or a list with data frames) are provided to estimate distance among gene usage of different repertoires. To visualise distances tcR employed the vis.radarlike function, see Section “Plots†for more detailed information.
# Transform "0:100" to distribution with Laplace correction
entropy(0:100, .laplace = 1) # (i.e., add "1" to every value before transformation).
## [1] 6.399261
entropy.seg(twb, HUMAN_TRBV) # Compute entropy of V-segment usage for each data frame.
Principal component analysis (PCA) is a statistical procedure for transforming a set of observations to a set of special values for analysis. In tcR implemented functions pca.segments for performing PCA on V- or J-usage, and pca.segments.2D for performing PCA on VJ-usage. For plotting the PCA results see the vis.pca function.
## Warning: In prcomp.default(t(as.matrix(.data)), ...) :
## extra argument '.genes' will be disregarded
# Return object of class "prcomp"
class(pca.segments(twb, .do.plot = F, .genes = HUMAN_TRBV))
## Warning: In prcomp.default(t(as.matrix(.data)), ...) :
## extra argument '.genes' will be disregarded
## [1] "prcomp"
Repertoire overlap analysis
tcR provides a number of functions for evaluating similarity of clonesets based on shared among clonesets clonotypes and working with data frames with shared clonotypes.
Overlap quantification
The general interface to all functions for computing cloneset overlap coefficients is the repOverlap function.
Number of shared clonotypes
The most straightforward yet a quite effective way to evaluate similarity of two clonesets is compute the number of shared clonotypes. tcR adds the new function intersectClonesets (repOverlap(your_data, 'exact')) which is by default computes the number of shared clonotypes using the “CDR3.nucleotide.sequence†columns of the given data frames, but user can change target columns by using arguments .type or .col. As in the find.clonotypes, user can choose which method apply to the elements: exact match of elements, match by Hamming distance or match by Levenshtein distance. Logical argument .norm is used to perform normalisation of the number of shared clonotypes by dividing this number by multiplication of clonesets’ sizes (strongly recommended otherwise your results will be correlating with clonesets’ sizes).
## Subj.A Subj.B Subj.C Subj.D
## Subj.A NA 4.6e-07 4.4e-07 4.0e-07
## Subj.B 4.6e-07 NA 2.7e-07 2.7e-07
## Subj.C 4.4e-07 2.7e-07 NA 6.2e-07
## Subj.D 4.0e-07 2.7e-07 6.2e-07 NA
# Intersect by amino acid clonotypes + V genes
repOverlap(twb, 'exact', 'aa', .vgene = T, .verbose = F)
## Subj.A Subj.B Subj.C Subj.D
## Subj.A NA 1.58e-06 6.50e-07 5.80e-07
## Subj.B 1.58e-06 NA 5.60e-07 4.70e-07
## Subj.C 6.50e-07 5.60e-07 NA 1.31e-06
## Subj.D 5.80e-07 4.70e-07 1.31e-06 NA
# Plot a heatmap of the number of shared clonotypes.
vis.heatmap(repOverlap(twb, 'exact', 'aa', .vgene = T, .verbose = F), .title = 'twb - (ave)-intersection', .labs = '')
See the vis.heatmap function in the Section “Visualisation†for the visualisation of the intersection results.
Functions intersectCount, intersectLogic and intersectIndices are more flexible in terms of choosing which columns to match. They all have parameter .col that specifies names of columns which will used in computing intersection. Function intersectCount returns number of similar elements; intersectIndices(x, y) returns 2-column matrix with the first column stands for an index of an element in the given x, and the second column stands for an index of that element of y which is similar to a relative element in x; intersectLogic(x, y) returns a logical vector of length(x) or nrow(x), where TRUE at position i means that element with index {i} has been found in the y.
# Get logic vector of shared elements, where
# elements are tuples of CDR3 nucleotide sequence and corresponding V-segment
imm.1.2 <- intersectLogic(twb[[1]], twb[[2]],
.col = c('CDR3.amino.acid.sequence', 'V.gene'))
# Get elements which are in both twb[[1]] and twb[[2]].
head(twb[[1]][imm.1.2, c('CDR3.amino.acid.sequence', 'V.gene')])
Number of shared clonotypes among the most abundant clonotypes may differ signigicantly from those with lesses count. To support research tcR offers the top.cross function, that apply tcR::intersectClonesets, e.g., to the first 1000 clonotypes, 2000, 3000 and so on up to the first 100000 clones, if supplied .n == seq(1000, 100000, 1000).
tcR also provides more complex similarity measures for evaluating the similarity of sets.
Tversky index (repOverlap(your_data, 'tversky') for clonesets or tversky.index for vectors) is an asymmetric similarity measure on sets that compares a variant to a prototype. If using default arguments, it’s similar to Dice’s coefficient.
Overlap coefficient (repOverlap(your_data, 'overlap') for clonesets or overlap.coef for vectors) is a similarity measure that measures the overlap between two sets, and is defined as the size of the intersection divided by the smaller of the size of the two sets.
Jaccard index (repOverlap(your_data, 'jaccard') for clonesets or jaccard.index for vectors) is a statistic used for comparing the similarity and diversity of sample sets.
Morisita’s overlap index (repOverlap(your_data, 'morisita') for clonesets or morisitas.index for other data) is a statistical measure of dispersion of individuals in a population and is used to compare overlap among samples. The formula is based on the assumption that increasing the size of the samples will increase the diversity because it will include different habitats (i.e. different faunas) (Morisita, 1959).
To visualise similarity among repertoires the vis.heatmap function is appropriate.
# Apply the Morisitas overlap index to the each pair of repertoires.
# Use information about V genes (i.e. one CDR3 clonotype is equal to another
# if and only if their CDR3 aa sequences are equal and their V genes are equal)
repOverlap(twb, 'morisita', 'aa', 'read.count', .vgene = T, .verbose = F)
## Subj.A Subj.B Subj.C Subj.D
## Subj.A NA 2.353362e-03 2.125406e-05 0.0002231617
## Subj.B 2.353362e-03 NA 9.765356e-06 0.0002413103
## Subj.C 2.125406e-05 9.765356e-06 NA 0.0004875433
## Subj.D 2.231617e-04 2.413103e-04 4.875433e-04 NA
Overlap statistics and tests
Overlap Z-score (OZ-score) - a measure for ???abnormality??? in overlaps
ozScore
Monte Carlo permutation test for pairwise and one-vs-all-wise within- and inter-group differences in a set of repertoires
permutDistTest
pca2euclid
Shared repertoire
To investigate a shared among a several repertoires clonotypes (“shared repertoireâ€) the package provided the shared.repertoire function along with functions for computing the shared repertoire statistics. The shared.representation function computes the number of shared clonotypes for each repertoire for each degree of sharing (i.e., number of people, in which indicated amount of clones have been found). The function shared.summary is equivalent to repOverlap(, 'exact') but applies to the shared repertoire data frame. Measuring distances among repertoires using the cosine similarity on vector of counts of shared sequences is also possible with the cosine.sharing function.
# Compute shared repertoire of amino acid CDR3 sequences and V genes
# which has been found in two or more people and return the Read.count column
# of such clonotypes from each data frame in the input list.
imm.shared <- shared.repertoire(.data = twb, .type = 'avrc', .min.ppl = 2, .verbose = F)
head(imm.shared)
For assessing the distribution of clonotypes in the given repertoire, tcR provides functions for evaluating the diversity (functions diversity and inverse.simpson) and the skewness of the clonal distribution (functions gini and gini.simpson), and a general interface to all of this functions repDiversity, which user should use to estimate the diversity of clonesets. Function diversity (repDiversity(your_clonesets, "div")) computes the ecological diversity index (with parameter .q for penalties for clones with large count). Function inverse.simpson (repDiversity(your_clonesets, "inv.simp")) computes the Inverse Simpson Index (i.e., inverse probability of choosing two similar clonotypes). Function gini (repDiversity(your_clonesets, "gini")) computes the economical Gini index of clonal distribution. Function gini.simpson (repDiversity(your_clonesets, "gini.simp")) computes the Gini-Simpson index. Function chao1 (repDiversity(your_clonesets, "chao1")) computes the Chao1 index, its SD and two 95 perc CI. Function repDiversity accepts single clonesets as well as a list of clonesets. Parameter .quant specifies which column to use for computing the diversity (print ?repDiversity to see more information about input arguments).
# Evaluate the diversity of clones by the ecological diversity index.
repDiversity(twb, 'div', 'read.count')
sapply(twb, function (x) diversity(x$Read.count))
# Compute the diversity as the inverse probability of choosing two similar clonotypes.
repDiversity(twb, 'inv.simp', 'read.prop')
sapply(twb, function (x) inverse.simpson(x$Read.proportion))
Plots of the distribution of CDR3 nucleotide sequences length (function vis.count.len) and the histogram of counts (function vis.number.count). Input data is either a data frame or a list with data frames. Argument .col specifies column’s name with clonotype counts. Argument .ncol specifies a number of columns in a plot with multiple distribution, i.e., if the input data is a list with data frames.
# I comment this to avoid a strange bug in ggplot2. Will uncomment later.
# vis.number.count(twb[[1]], .name = "twb[[1]] count distribution")
Top proportions bar plot
For the visualisation of proportions of the most abundant clonotypes in a repertoire tcR offers the vis.top.proportions function. As input the function receives either data frame or a list with data frames (argument .data), an integer vector with number of clonotypes for computing proportions of count for this clonotypes (argument .head), and a column’s name with clonotype counts (argument .col).
For the visualisation of how much space occupied each group of clonotypes, divided into groups by their proportions in the data, use the vis.clonal.space function. As an input it receives the output of the clonal.space.homeostasis function.
Pairwise distances or similarity of repertoires can be represented as qudratic matrices, in which each row and column represented a cloneset, and each value in every cell (i, j) is a distance between repertoires with indices i and j. One way to visalise such matrices is using “heatmapsâ€. For plotting heatmaps in tcR implemented the vis.heatmap function. With changing input arguments user can change names of labs, title and legend.
Another way to repsent distances among objects is “radar-like†plots (because this plots is not exactly radar plots) realised in tcR throught the vis.radarlike function. Argument .ncol specifies a number of columns of radar-like plots in a viewport.
For the visualisation of gene usage tcR employes subroutines for making classical histograms using the vis.gene.usage function. The function accept clonesets, lists of clonesets or output from the geneUsage function. If input is a cloneset(s), then user should specify a gene alphabet (e.g., HUMAN_TRBV) in order to compute the gene usage. Using a parameter , user can change type of the output between an output as histograms for each cloneset in the input list (.dodge = F) or an output as an one histogram for all data, which is very useful for comparing distribution of genes (.dodge = T). If .dodge=F and input are lists of clonesets or a gene usage of a few clonesets, than user with argument .ncol can specify how many columns of histograms will be outputted. With .coord.flip user can flip coordinates so genes will be at the left side of the plot.
vis.gene.usage(twb[[1]], HUMAN_TRBV, .main = 'Sample I V-usage')
vis.gene.usage(twb[[2]], HUMAN_TRBV, .main = 'Sample II V-usage', .coord.flip = T)
For the visualisation of results from the prcomp function (i.e., objects of class prcomp), tcR provides the vis.pca function. Input arguments for the function are an object of class prcomp and a (if needed) list with groups (vectors of indices of samples) for colouring points in the plot.
## Warning: In prcomp.default(t(as.matrix(.data)), ...) :
## extra argument '.genes' will be disregarded
Logo-like plot
Logo-like graphs for visualisation of nucleotide or amino acid motif sequences / profiles.
km <- get.kmers(twb[[1]]$CDR3.amino.acid.sequence, .head = 100, .k = 7, .verbose = F)
d <- kmer.profile(km)
vis.logo(d)
Mutation networks
Mutation network (or a mutation graph) is a graph with vertices representing nucleotide or in-frame amino acid sequences (out-of-frame amino acid sequences will be automatically filtered out by tcR functions for mutation network creating) and edges which connecting pairs of sequences with hamming distance (parameter .method = ‘hamm’) or edit distance (parameter .method = ‘lev’) between them no more than specified in the .max.errors function parameter of the mutation.network function. To create a mutation network first what you need is to make a shared repertoires and then apply the mutation.network function to this shared repertoire:
# data(twb)
twb.shared <- shared.repertoire(twb, .head = 1000, .verbose = F)
G <- mutation.network(twb.shared)
G
Feel free to contact me for the package-related or immunoinformatics research-related questions.
If you spot a bug or would like to see something useful for you in the package feel free to raise an issue at tcR GitHub: https://github.com/imminfo/tcr/issues
Appendix A: Kmers manipulation and processing
In the package implemented functions for working with k-mers. Function get.kmers generates k-mers from the given chatacter vector or a data frame with columns for sequences and a count for each sequence.
Appendix B: Nucleotide and amino acid sequences manipulation
The package also provides a several number of functions for performing classic bioinformatics tasks on strings. For more powerful subroutines see the Bioconductor’s Biostrings package.
Nucleotide sequence manipulation
Functions for basic nucleotide sequences manipulations: reverse-complement, translation and GC-content computation. All functions are vectorised.
Function codon.variants returns a list of vectors of nucleotide codons for each letter for each input amino acid sequence. Function translated.nucl.sequences returns the number of nucleotide sequences, which, when translated, will result in the given amino acid sequence(s). Function reverse.translation return all nucleotide sequences, which is translated to the given amino acid sequences. Optional argument .nucseq for each of this function provides restriction for nucleotides, which cannot be changed. All functions are vectorised.