qlcMatrix/ 0000755 0001762 0000144 00000000000 13266310250 012214 5 ustar ligges users qlcMatrix/inst/ 0000755 0001762 0000144 00000000000 13266303707 013202 5 ustar ligges users qlcMatrix/inst/doc/ 0000755 0001762 0000144 00000000000 13266303707 013747 5 ustar ligges users qlcMatrix/inst/doc/transformingDataSparse.Rmd 0000644 0001762 0000144 00000010243 13266125510 021066 0 ustar ligges users --- title: 'Transforming data into sparse matrices' author: "Michael Cysouw" date: '`r Sys.Date()`' output: rmarkdown::html_vignette: md_document: variant: markdown_github pdf_document: number_sections: yes latex_engine: xelatex vignette: > %\VignetteIndexEntry{Transforming data into sparse matrices} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} library(qlcMatrix) ``` # Transforming data into sparse matrices The basic functions of this package are almost trivial, but they allow for a highly flexible and efficient transformation of data into sparse matrices. The tree basic functions are `ttMatrix`, `pwMatrix`, and `jMatrix`. This vignette will present a gentle instroduction to these three functions. ### ttMatrix The principle of `ttMatrix` ("type-token Matrix") is that a sparse matrix is created from a vector. All unique elements in the vector are listed as rows ("types"), and the elements of the vector itself are listed as columns ("tokens"). The matrix shows which types are found in which position of the original vector. Take a simple character vector with repeating elements as an example, then `ttMatrix` will return the row names (`$rownames`) by default separately from the matrix (`$M`). The column names are of course identical to the input in `data`. ```{r ttIntro} data <- c("a", "b", "a", "c", "b", "c", "A") ttMatrix(data) ``` You can also include the row and column names into the matrix as output by using the option `simplify = TRUE`. Note that in many cases of large sparse matrices, the same row or column names will be repeated in different sparse matrices. In such situations it is preferable to store the row or column names separately. Simply duplicating them migth take a lot of space. Also note that the ordering of the rows (the "types") is crucial. This ordering is determined by the so-called `collation.locale` option, which defaults to "C"" (which means the numerical ordering of the Unicode Standard is used). This ordering is different from the default ordering on most systems, which might internally for example use the collation locale `en_US.UTF-8'. The difference can be discerned in the example below, as the capital letters are ordered after the lowercase letters in the US-english locale, but before in the "C" locale (compare with the output of the previous call). ```{r ttSimplify} ttMatrix(data, simplify = TRUE, collation.locale = "en_US.UTF-8") ``` ### pwMatrix The principle of `pwMatrix` ("part-whole Matrix") is so simple that it seems almost trivial. However, in combination with a bit matrix algebra the utility of this simple sparse Matrix is very impressive. First consider a simple vector of strings. Basically, `pwMatrix` looks at all parts of the strings (split into parts as determined by the separator `sep`). In the example below, The row and column names are added to the matrix by `simplify = TRUE`, and a gap has been added between the strings (like treating them as a sentence with boundaries). ```{r pwIntro} strings <- c("this", "is", "that") (PW <- pwMatrix(strings, sep = "", simplify = TRUE, gap.length = 1, gap.symbol = "_") ) ``` By itself, this part-whole Matrix is not very interesting. It becomes more interesting when we also consider the type-token Matrix of the "parts" (i.e. rownames) of the part-whole Matrix. The matrix product of these two matrices gives a sparse way to count how often each "part" occurs in each "whole". This is a quick and sparse way to count parts. ```{r pwUsage} TT <- ttMatrix(rownames(PW), simplify = TRUE) printSpMatrix(TT, col.names = TRUE) (TT*1) %*% (PW*1) ``` Even more amazing is the following trick. Using a "upper-shift" matrix (i.e. a matrix with 1s on the diagonal one above the main diagonal), then the following simple matrix multiplication lists the number of adjacent combinations (i.e. how often is one letter followed by another in the strings? First element in the rows, second in the columns): ```{r adjacency} S <- bandSparse( n = ncol(TT), k = 1) (TT*1) %*% (S*1) %*% t(TT*1) ``` This kind of computations start making a lot of sense once the number of "parts" start rising quickly, e.g. when looking at different words in sentences. qlcMatrix/inst/doc/transformingDataSparse.html 0000644 0001762 0000144 00000037353 13266303707 021331 0 ustar ligges users
The basic functions of this package are almost trivial, but they allow for a highly flexible and efficient transformation of data into sparse matrices. The tree basic functions are ttMatrix
, pwMatrix
, and jMatrix
. This vignette will present a gentle instroduction to these three functions.
The principle of ttMatrix
(“type-token Matrix”) is that a sparse matrix is created from a vector. All unique elements in the vector are listed as rows (“types”), and the elements of the vector itself are listed as columns (“tokens”). The matrix shows which types are found in which position of the original vector. Take a simple character vector with repeating elements as an example, then ttMatrix
will return the row names ($rownames
) by default separately from the matrix ($M
). The column names are of course identical to the input in data
.
data <- c("a", "b", "a", "c", "b", "c", "A")
ttMatrix(data)
## $M
## 4 x 7 sparse Matrix of class "ngCMatrix"
##
## [1,] . . . . . . |
## [2,] | . | . . . .
## [3,] . | . . | . .
## [4,] . . . | . | .
##
## $rownames
## [1] "A" "a" "b" "c"
You can also include the row and column names into the matrix as output by using the option simplify = TRUE
. Note that in many cases of large sparse matrices, the same row or column names will be repeated in different sparse matrices. In such situations it is preferable to store the row or column names separately. Simply duplicating them migth take a lot of space.
Also note that the ordering of the rows (the “types”) is crucial. This ordering is determined by the so-called collation.locale
option, which defaults to “C”" (which means the numerical ordering of the Unicode Standard is used). This ordering is different from the default ordering on most systems, which might internally for example use the collation locale `en_US.UTF-8’. The difference can be discerned in the example below, as the capital letters are ordered after the lowercase letters in the US-english locale, but before in the “C” locale (compare with the output of the previous call).
ttMatrix(data, simplify = TRUE, collation.locale = "en_US.UTF-8")
## 4 x 7 sparse Matrix of class "ngCMatrix"
## a b a c b c A
## a | . | . . . .
## A . . . . . . |
## b . | . . | . .
## c . . . | . | .
The principle of pwMatrix
(“part-whole Matrix”) is so simple that it seems almost trivial. However, in combination with a bit matrix algebra the utility of this simple sparse Matrix is very impressive. First consider a simple vector of strings. Basically, pwMatrix
looks at all parts of the strings (split into parts as determined by the separator sep
). In the example below, The row and column names are added to the matrix by simplify = TRUE
, and a gap has been added between the strings (like treating them as a sentence with boundaries).
strings <- c("this", "is", "that")
(PW <- pwMatrix(strings, sep = "", simplify = TRUE, gap.length = 1, gap.symbol = "_") )
## 12 x 3 sparse Matrix of class "ngCMatrix"
## this is that
## t | . .
## h | . .
## i | . .
## s | . .
## _ . . .
## i . | .
## s . | .
## _ . . .
## t . . |
## h . . |
## a . . |
## t . . |
By itself, this part-whole Matrix is not very interesting. It becomes more interesting when we also consider the type-token Matrix of the “parts” (i.e. rownames) of the part-whole Matrix. The matrix product of these two matrices gives a sparse way to count how often each “part” occurs in each “whole”. This is a quick and sparse way to count parts.
TT <- ttMatrix(rownames(PW), simplify = TRUE)
printSpMatrix(TT, col.names = TRUE)
## t h i s _ i s _ t h a t
## _ . . . . | . . | . . . .
## a . . . . . . . . . . | .
## h . | . . . . . . . | . .
## i . . | . . | . . . . . .
## s . . . | . . | . . . . .
## t | . . . . . . . | . . |
(TT*1) %*% (PW*1)
## 6 x 3 sparse Matrix of class "dgCMatrix"
## this is that
## _ . . .
## a . . 1
## h 1 . 1
## i 1 1 .
## s 1 1 .
## t 1 . 2
Even more amazing is the following trick. Using a “upper-shift” matrix (i.e. a matrix with 1s on the diagonal one above the main diagonal), then the following simple matrix multiplication lists the number of adjacent combinations (i.e. how often is one letter followed by another in the strings? First element in the rows, second in the columns):
S <- bandSparse( n = ncol(TT), k = 1)
(TT*1) %*% (S*1) %*% t(TT*1)
## 6 x 6 sparse Matrix of class "dgCMatrix"
## _ a h i s t
## _ . . . 1 . 1
## a . . . . . 1
## h . 1 . 1 . .
## i . . . . 2 .
## s 2 . . . . .
## t . . 2 . . .
This kind of computations start making a lot of sense once the number of “parts” start rising quickly, e.g. when looking at different words in sentences.