tokenizers/ 0000755 0001762 0000144 00000000000 13257243614 012456 5 ustar ligges users tokenizers/inst/ 0000755 0001762 0000144 00000000000 13257220650 013426 5 ustar ligges users tokenizers/inst/CITATION 0000644 0001762 0000144 00000001706 13257205112 014563 0 ustar ligges users citHeader(paste0(
"To cite the tokenizers package in publications, please cite the ",
"paper in the Journal of Open Source Software:"
))
citEntry(
entry = "Article",
title = "Fast, Consistent Tokenization of Natural Language Text",
author = personList(as.person("Lincoln A. Mullen"),
as.person("Kenneth Benoit"),
as.person("Os Keyes"),
as.person("Dmitry Selivanov"),
as.person("Jeffrey Arnold")),
journal = "Journal of Open Source Software",
year = "2018",
volume = "3",
issue = "23",
pages = "655",
url = "https://doi.org/10.21105/joss.00655",
doi = "10.21105/joss.00655",
textVersion = paste('Lincoln A. Mullen et al.,',
'"Fast, Consistent Tokenization of Natural Language',
'Text," Journal of Open Source Software 3, no. 23',
'(2018): 655, https://doi.org/10.21105/joss.00655.')
)
tokenizers/inst/doc/ 0000755 0001762 0000144 00000000000 13257220650 014173 5 ustar ligges users tokenizers/inst/doc/introduction-to-tokenizers.Rmd 0000644 0001762 0000144 00000012334 13256545214 022203 0 ustar ligges users ---
title: "Introduction to the tokenizers Package"
author: "Lincoln Mullen"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to the tokenizers Package}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## Package overview
In natural language processing, tokenization is the process of breaking human-readable text into machine readable components. The most obvious way to tokenize a text is to split the text into words. But there are many other ways to tokenize a text, the most useful of which are provided by this package.
The tokenizers in this package have a consistent interface. They all take either a character vector of any length, or a list where each element is a character vector of length one. The idea is that each element comprises a text. Then each function returns a list with the same length as the input vector, where each element in the list contains the tokens generated by the function. If the input character vector or list is named, then the names are preserved, so that the names can serve as identifiers.
Using the following sample text, the rest of this vignette demonstrates the different kinds of tokenizers in this package.
```{r}
library(tokenizers)
options(max.print = 25)
james <- paste0(
"The question thus becomes a verbal one\n",
"again; and our knowledge of all these early stages of thought and feeling\n",
"is in any case so conjectural and imperfect that farther discussion would\n",
"not be worth while.\n",
"\n",
"Religion, therefore, as I now ask you arbitrarily to take it, shall mean\n",
"for us _the feelings, acts, and experiences of individual men in their\n",
"solitude, so far as they apprehend themselves to stand in relation to\n",
"whatever they may consider the divine_. Since the relation may be either\n",
"moral, physical, or ritual, it is evident that out of religion in the\n",
"sense in which we take it, theologies, philosophies, and ecclesiastical\n",
"organizations may secondarily grow.\n"
)
```
## Character and character-shingle tokenizers
The character tokenizer splits texts into individual characters.
```{r}
tokenize_characters(james)[[1]]
```
You can also tokenize into character-based shingles.
```{r}
tokenize_character_shingles(james, n = 3, n_min = 3,
strip_non_alphanum = FALSE)[[1]][1:20]
```
## Word and word-stem tokenizers
The word tokenizer splits texts into words.
```{r}
tokenize_words(james)
```
Word stemming is provided by the [SnowballC](https://cran.r-project.org/package=SnowballC) package.
```{r}
tokenize_word_stems(james)
```
You can also provide a vector of stopwords which will be omitted. The [stopwords package](https://github.com/quanteda/stopwords), which contains stopwords for many languages from several sources, is recommended. This argument also works with the n-gram and skip n-gram tokenizers.
```{r}
library(stopwords)
tokenize_words(james, stopwords = stopwords::stopwords("en"))
```
An alternative word stemmer often used in NLP that preserves punctuation and separates common English contractions is the Penn Treebank tokenizer.
```{r}
tokenize_ptb(james)
```
## N-gram and skip n-gram tokenizers
An n-gram is a contiguous sequence of words containing at least `n_min` words and at most `n` words. This function will generate all such combinations of n-grams, omitting stopwords if desired.
```{r}
tokenize_ngrams(james, n = 5, n_min = 2,
stopwords = stopwords::stopwords("en"))
```
A skip n-gram is like an n-gram in that it takes the `n` and `n_min` parameters. But rather than returning contiguous sequences of words, it will also return sequences of n-grams skipping words with gaps between `0` and the value of `k`. This function generates all such sequences, again omitting stopwords if desired. Note that the number of tokens returned can be very large.
```{r}
tokenize_skip_ngrams(james, n = 5, n_min = 2, k = 2,
stopwords = stopwords::stopwords("en"))
```
## Tweet tokenizer
Tokenizing tweets requires special attention, since usernames (`@whoever`) and hashtags (`#hashtag`) use special characters that might otherwise be stripped away.
```{r}
tokenize_tweets("Welcome, @user, to the tokenizers package. #rstats #forever")
```
## Sentence and paragraph tokenizers
Sometimes it is desirable to split texts into sentences or paragraphs prior to tokenizing into other forms.
```{r, collapse=FALSE}
tokenize_sentences(james)
tokenize_paragraphs(james)
```
## Text chunking
When one has a very long document, sometimes it is desirable to split the document into smaller chunks, each with the same length. This function chunks a document and gives it each of the chunks an ID to show their order. These chunks can then be further tokenized.
```{r}
chunks <- chunk_text(mobydick, chunk_size = 100, doc_id = "mobydick")
length(chunks)
chunks[5:6]
tokenize_words(chunks[5:6])
```
## Counting words, characters, sentences
The package also offers functions for counting words, characters, and sentences in a format which works nicely with the rest of the functions.
```{r}
count_words(mobydick)
count_characters(mobydick)
count_sentences(mobydick)
```
tokenizers/inst/doc/tif-and-tokenizers.html 0000644 0001762 0000144 00000040274 13257220650 020605 0 ustar ligges users
The Text Interchange Formats and the tokenizers Package
The Text Interchange Formats and the tokenizers Package
Lincoln Mullen
The Text Interchange Formats are a set of standards defined at an rOpenSci sponsored meeting in London in 2017. The formats allow R text analysis packages to target defined inputs and outputs for corpora, tokens, and document-term matrices. By adhering to these recommendations, R packages can buy into an interoperable ecosystem.
The TIF recommendations are still a draft, but the tokenizers package implements its recommendation to accept both of the corpora formats and to output one of its recommended tokens formats.
Consider these two recommended forms of a corpus. One (corpus_c
) is a named character vector; the other (corpus_d
) is a data frame. They both include a document ID and the full text for each item. The data frame format obviously allows for the use of other metadata fields besides the document ID, whereas the other format does not. Using the coercion functions in the tif package, one could switch back and forth between these formats. Tokenizers also supports a corpus formatted as a named list where each element is a character vector of length one (corpus_l
), though this is not a part of the draft TIF standards.
All of the tokenizers in this package can accept any of those formats and will return an identical output for each.
The output of all of the tokenizers is a named list, where each element of the list corresponds to a document in the corpus. The names of the list are the document IDs, and the elements are character vectors containing the tokens.
This format can be coerced to a data frame of document IDs and tokens, one row per token, using the coercion functions in the tif package. That tokens data frame would look like this.
#> doc_id token
#> 1 man_comes_around there's a
#> 2 man_comes_around a man
#> 3 man_comes_around man goin
#> 4 man_comes_around goin round
#> 5 man_comes_around round takin
#> 6 man_comes_around takin names
#> 7 wont_back_down well i
#> 8 wont_back_down i won't
#> 9 wont_back_down won't back
#> 10 wont_back_down back down
tokenizers/inst/doc/introduction-to-tokenizers.html 0000644 0001762 0000144 00000116617 13257220647 022437 0 ustar ligges users
Introduction to the tokenizers Package
Introduction to the tokenizers Package
Lincoln Mullen
Package overview
In natural language processing, tokenization is the process of breaking human-readable text into machine readable components. The most obvious way to tokenize a text is to split the text into words. But there are many other ways to tokenize a text, the most useful of which are provided by this package.
The tokenizers in this package have a consistent interface. They all take either a character vector of any length, or a list where each element is a character vector of length one. The idea is that each element comprises a text. Then each function returns a list with the same length as the input vector, where each element in the list contains the tokens generated by the function. If the input character vector or list is named, then the names are preserved, so that the names can serve as identifiers.
Using the following sample text, the rest of this vignette demonstrates the different kinds of tokenizers in this package.
library(tokenizers)
options(max.print = 25)
james <- paste0(
"The question thus becomes a verbal one\n",
"again; and our knowledge of all these early stages of thought and feeling\n",
"is in any case so conjectural and imperfect that farther discussion would\n",
"not be worth while.\n",
"\n",
"Religion, therefore, as I now ask you arbitrarily to take it, shall mean\n",
"for us _the feelings, acts, and experiences of individual men in their\n",
"solitude, so far as they apprehend themselves to stand in relation to\n",
"whatever they may consider the divine_. Since the relation may be either\n",
"moral, physical, or ritual, it is evident that out of religion in the\n",
"sense in which we take it, theologies, philosophies, and ecclesiastical\n",
"organizations may secondarily grow.\n"
)
Character and character-shingle tokenizers
The character tokenizer splits texts into individual characters.
You can also tokenize into character-based shingles.
Word and word-stem tokenizers
The word tokenizer splits texts into words.
Word stemming is provided by the SnowballC package.
You can also provide a vector of stopwords which will be omitted. The stopwords package, which contains stopwords for many languages from several sources, is recommended. This argument also works with the n-gram and skip n-gram tokenizers.
An alternative word stemmer often used in NLP that preserves punctuation and separates common English contractions is the Penn Treebank tokenizer.
N-gram and skip n-gram tokenizers
An n-gram is a contiguous sequence of words containing at least n_min
words and at most n
words. This function will generate all such combinations of n-grams, omitting stopwords if desired.
A skip n-gram is like an n-gram in that it takes the n
and n_min
parameters. But rather than returning contiguous sequences of words, it will also return sequences of n-grams skipping words with gaps between 0
and the value of k
. This function generates all such sequences, again omitting stopwords if desired. Note that the number of tokens returned can be very large.
Sentence and paragraph tokenizers
Sometimes it is desirable to split texts into sentences or paragraphs prior to tokenizing into other forms.
#> [[1]]
#> [1] "The question thus becomes a verbal one again; and our knowledge of all these early stages of thought and feeling is in any case so conjectural and imperfect that farther discussion would not be worth while."
#> [2] "Religion, therefore, as I now ask you arbitrarily to take it, shall mean for us _the feelings, acts, and experiences of individual men in their solitude, so far as they apprehend themselves to stand in relation to whatever they may consider the divine_."
#> [3] "Since the relation may be either moral, physical, or ritual, it is evident that out of religion in the sense in which we take it, theologies, philosophies, and ecclesiastical organizations may secondarily grow."
#> [[1]]
#> [1] "The question thus becomes a verbal one again; and our knowledge of all these early stages of thought and feeling is in any case so conjectural and imperfect that farther discussion would not be worth while."
#> [2] "Religion, therefore, as I now ask you arbitrarily to take it, shall mean for us _the feelings, acts, and experiences of individual men in their solitude, so far as they apprehend themselves to stand in relation to whatever they may consider the divine_. Since the relation may be either moral, physical, or ritual, it is evident that out of religion in the sense in which we take it, theologies, philosophies, and ecclesiastical organizations may secondarily grow. "
Text chunking
When one has a very long document, sometimes it is desirable to split the document into smaller chunks, each with the same length. This function chunks a document and gives it each of the chunks an ID to show their order. These chunks can then be further tokenized.
Counting words, characters, sentences
The package also offers functions for counting words, characters, and sentences in a format which works nicely with the rest of the functions.
tokenizers/inst/doc/tif-and-tokenizers.Rmd 0000644 0001762 0000144 00000007232 13256545214 020365 0 ustar ligges users ---
title: "The Text Interchange Formats and the tokenizers Package"
author: "Lincoln Mullen"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{The Text Interchange Formats and the tokenizers Package}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
The [Text Interchange Formats](https://github.com/ropensci/tif) are a set of standards defined at an [rOpenSci](https://ropensci.org/) sponsored [meeting in London](http://textworkshop17.ropensci.org/) in 2017. The formats allow R text analysis packages to target defined inputs and outputs for corpora, tokens, and document-term matrices. By adhering to these recommendations, R packages can buy into an interoperable ecosystem.
The TIF recommendations are still a draft, but the tokenizers package implements its recommendation to accept both of the corpora formats and to output one of its recommended tokens formats.
Consider these two recommended forms of a corpus. One (`corpus_c`) is a named character vector; the other (`corpus_d`) is a data frame. They both include a document ID and the full text for each item. The data frame format obviously allows for the use of other metadata fields besides the document ID, whereas the other format does not. Using the coercion functions in the tif package, one could switch back and forth between these formats. Tokenizers also supports a corpus formatted as a named list where each element is a character vector of length one (`corpus_l`), though this is not a part of the draft TIF standards.
```{r}
# Named list
(corpus_l <- list(man_comes_around = "There's a man goin' 'round takin' names",
wont_back_down = "Well I won't back down, no I won't back down",
bird_on_a_wire = "Like a bird on a wire"))
# Named character vector
(corpus_c <- unlist(corpus_l))
# Data frame
(corpus_d <- data.frame(doc_id = names(corpus_c), text = unname(corpus_c),
stringsAsFactors = FALSE))
```
All of the tokenizers in this package can accept any of those formats and will return an identical output for each.
```{r}
library(tokenizers)
tokens_l <- tokenize_ngrams(corpus_l, n = 2)
tokens_c <- tokenize_ngrams(corpus_c, n = 2)
tokens_d <- tokenize_ngrams(corpus_c, n = 2)
# Are all these identical?
all(identical(tokens_l, tokens_c),
identical(tokens_c, tokens_d),
identical(tokens_l, tokens_d))
```
The output of all of the tokenizers is a named list, where each element of the list corresponds to a document in the corpus. The names of the list are the document IDs, and the elements are character vectors containing the tokens.
```{r}
tokens_l
```
This format can be coerced to a data frame of document IDs and tokens, one row per token, using the coercion functions in the tif package. That tokens data frame would look like this.
```{r, echo=FALSE}
sample_tokens_df <- structure(list(doc_id = c("man_comes_around", "man_comes_around",
"man_comes_around", "man_comes_around", "man_comes_around", "man_comes_around",
"wont_back_down", "wont_back_down", "wont_back_down", "wont_back_down",
"wont_back_down", "wont_back_down", "wont_back_down", "wont_back_down",
"wont_back_down", "bird_on_a_wire", "bird_on_a_wire", "bird_on_a_wire",
"bird_on_a_wire", "bird_on_a_wire"), token = c("there's a", "a man",
"man goin", "goin round", "round takin", "takin names", "well i",
"i won't", "won't back", "back down", "down no", "no i", "i won't",
"won't back", "back down", "like a", "a bird", "bird on", "on a",
"a wire")), .Names = c("doc_id", "token"), row.names = c(NA,
-20L), class = "data.frame")
head(sample_tokens_df, 10)
```
tokenizers/inst/doc/introduction-to-tokenizers.R 0000644 0001762 0000144 00000005367 13257220647 021673 0 ustar ligges users ## ----setup, include = FALSE----------------------------------------------
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
## ------------------------------------------------------------------------
library(tokenizers)
options(max.print = 25)
james <- paste0(
"The question thus becomes a verbal one\n",
"again; and our knowledge of all these early stages of thought and feeling\n",
"is in any case so conjectural and imperfect that farther discussion would\n",
"not be worth while.\n",
"\n",
"Religion, therefore, as I now ask you arbitrarily to take it, shall mean\n",
"for us _the feelings, acts, and experiences of individual men in their\n",
"solitude, so far as they apprehend themselves to stand in relation to\n",
"whatever they may consider the divine_. Since the relation may be either\n",
"moral, physical, or ritual, it is evident that out of religion in the\n",
"sense in which we take it, theologies, philosophies, and ecclesiastical\n",
"organizations may secondarily grow.\n"
)
## ------------------------------------------------------------------------
tokenize_characters(james)[[1]]
## ------------------------------------------------------------------------
tokenize_character_shingles(james, n = 3, n_min = 3,
strip_non_alphanum = FALSE)[[1]][1:20]
## ------------------------------------------------------------------------
tokenize_words(james)
## ------------------------------------------------------------------------
tokenize_word_stems(james)
## ------------------------------------------------------------------------
library(stopwords)
tokenize_words(james, stopwords = stopwords::stopwords("en"))
## ------------------------------------------------------------------------
tokenize_ptb(james)
## ------------------------------------------------------------------------
tokenize_ngrams(james, n = 5, n_min = 2,
stopwords = stopwords::stopwords("en"))
## ------------------------------------------------------------------------
tokenize_skip_ngrams(james, n = 5, n_min = 2, k = 2,
stopwords = stopwords::stopwords("en"))
## ------------------------------------------------------------------------
tokenize_tweets("Welcome, @user, to the tokenizers package. #rstats #forever")
## ---- collapse=FALSE-----------------------------------------------------
tokenize_sentences(james)
tokenize_paragraphs(james)
## ------------------------------------------------------------------------
chunks <- chunk_text(mobydick, chunk_size = 100, doc_id = "mobydick")
length(chunks)
chunks[5:6]
tokenize_words(chunks[5:6])
## ------------------------------------------------------------------------
count_words(mobydick)
count_characters(mobydick)
count_sentences(mobydick)
tokenizers/inst/doc/tif-and-tokenizers.R 0000644 0001762 0000144 00000003644 13257220647 020050 0 ustar ligges users ## ----setup, include = FALSE----------------------------------------------
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
## ------------------------------------------------------------------------
# Named list
(corpus_l <- list(man_comes_around = "There's a man goin' 'round takin' names",
wont_back_down = "Well I won't back down, no I won't back down",
bird_on_a_wire = "Like a bird on a wire"))
# Named character vector
(corpus_c <- unlist(corpus_l))
# Data frame
(corpus_d <- data.frame(doc_id = names(corpus_c), text = unname(corpus_c),
stringsAsFactors = FALSE))
## ------------------------------------------------------------------------
library(tokenizers)
tokens_l <- tokenize_ngrams(corpus_l, n = 2)
tokens_c <- tokenize_ngrams(corpus_c, n = 2)
tokens_d <- tokenize_ngrams(corpus_c, n = 2)
# Are all these identical?
all(identical(tokens_l, tokens_c),
identical(tokens_c, tokens_d),
identical(tokens_l, tokens_d))
## ------------------------------------------------------------------------
tokens_l
## ---- echo=FALSE---------------------------------------------------------
sample_tokens_df <- structure(list(doc_id = c("man_comes_around", "man_comes_around",
"man_comes_around", "man_comes_around", "man_comes_around", "man_comes_around",
"wont_back_down", "wont_back_down", "wont_back_down", "wont_back_down",
"wont_back_down", "wont_back_down", "wont_back_down", "wont_back_down",
"wont_back_down", "bird_on_a_wire", "bird_on_a_wire", "bird_on_a_wire",
"bird_on_a_wire", "bird_on_a_wire"), token = c("there's a", "a man",
"man goin", "goin round", "round takin", "takin names", "well i",
"i won't", "won't back", "back down", "down no", "no i", "i won't",
"won't back", "back down", "like a", "a bird", "bird on", "on a",
"a wire")), .Names = c("doc_id", "token"), row.names = c(NA,
-20L), class = "data.frame")
head(sample_tokens_df, 10)
tokenizers/tests/ 0000755 0001762 0000144 00000000000 12775200571 013617 5 ustar ligges users tokenizers/tests/testthat.R 0000644 0001762 0000144 00000000100 12775200571 015571 0 ustar ligges users library(testthat)
library(tokenizers)
test_check("tokenizers")
tokenizers/tests/testthat/ 0000755 0001762 0000144 00000000000 13257243614 015460 5 ustar ligges users tokenizers/tests/testthat/test-encoding.R 0000644 0001762 0000144 00000001233 13252224016 020333 0 ustar ligges users context("Encodings")
test_that("Encodings work on Windows", {
input <- "César Moreira Nuñez"
reference <- c("césar", "moreira", "nuñez")
reference_enc <- c("UTF-8", "unknown", "UTF-8")
output_n1 <- tokenize_ngrams(input, n = 1, simplify = TRUE)
output_words <- tokenize_words(input, simplify = TRUE)
output_skip <- tokenize_skip_ngrams(input, n = 1, k = 0, simplify = TRUE)
expect_equal(output_n1, reference)
expect_equal(output_words, reference)
expect_equal(output_skip, reference)
expect_equal(Encoding(output_n1), reference_enc)
expect_equal(Encoding(output_words), reference_enc)
expect_equal(Encoding(output_skip), reference_enc)
}) tokenizers/tests/testthat/test-utils.R 0000644 0001762 0000144 00000000664 12775200571 017725 0 ustar ligges users context("Utils")
test_that("Inputs are verified correct", {
expect_silent(check_input(letters))
expect_silent(check_input(list(a = "a", b = "b")))
expect_error(check_input(1:10))
expect_error(check_input(list(a = "a", b = letters)))
expect_error(check_input(list(a = "a", b = 2)))
})
test_that("Stopwords are removed", {
expect_equal(remove_stopwords(letters[1:5], stopwords = c("d", "e")),
letters[1:3])
}) tokenizers/tests/testthat/test-shingles.R 0000644 0001762 0000144 00000003410 13070504253 020362 0 ustar ligges users context("Shingle tokenizers")
test_that("Character shingle tokenizer works as expected", {
out_l <- tokenize_character_shingles(docs_l, n = 3, n_min = 2)
out_c <- tokenize_character_shingles(docs_c, n = 3, n_min = 2)
out_1 <- tokenize_character_shingles(docs_c[1], n = 3, n_min = 2,
simplify = TRUE)
expect_is(out_l, "list")
expect_is(out_l[[1]], "character")
expect_is(out_c, "list")
expect_is(out_c[[1]], "character")
expect_is(out_1, "character")
expect_identical(out_l, out_c)
expect_identical(out_l[[1]], out_1)
expect_identical(out_c[[1]], out_1)
expect_named(out_l, names(docs_l))
expect_named(out_c, names(docs_c))
expect_error(tokenize_ngrams(bad_list))
})
test_that("Character shingle tokenizer produces correct output", {
phrase <- c("Remember who commended thy yellow stockings",
"And wished to see thee cross-gartered.")
names(phrase) <- c("Malvolio 1", "Malvolio 2")
out_d <- tokenize_character_shingles(phrase)
out_asis <- tokenize_character_shingles(phrase, lowercase = FALSE,
strip_non_alphanum = FALSE)
expect_identical(out_d[[1]][1:12], c("rem", "eme", "mem", "emb", "mbe", "ber",
"erw", "rwh", "who", "hoc", "oco", "com"))
expect_identical(out_asis[[2]][1:15], c("And", "nd ", "d w", " wi", "wis",
"ish", "she", "hed", "ed ", "d t",
" to", "to ", "o s", " se", "see"))
})
test_that("Character shingle tokenizer consistently produces NAs where appropriate", {
test <- c("This is a text", NA, "So is this")
names(test) <- letters[1:3]
out <- tokenize_character_shingles(test)
expect_true(is.na(out$b))
}) tokenizers/tests/testthat/moby-ch2.txt 0000644 0001762 0000144 00000017432 12775200571 017647 0 ustar ligges users CHAPTER 2. The Carpet-Bag.
I stuffed a shirt or two into my old carpet-bag, tucked it under my arm,
and started for Cape Horn and the Pacific. Quitting the good city of
old Manhatto, I duly arrived in New Bedford. It was a Saturday night in
December. Much was I disappointed upon learning that the little packet
for Nantucket had already sailed, and that no way of reaching that place
would offer, till the following Monday.
As most young candidates for the pains and penalties of whaling stop at
this same New Bedford, thence to embark on their voyage, it may as well
be related that I, for one, had no idea of so doing. For my mind was
made up to sail in no other than a Nantucket craft, because there was a
fine, boisterous something about everything connected with that famous
old island, which amazingly pleased me. Besides though New Bedford has
of late been gradually monopolising the business of whaling, and though
in this matter poor old Nantucket is now much behind her, yet Nantucket
was her great original--the Tyre of this Carthage;--the place where the
first dead American whale was stranded. Where else but from Nantucket
did those aboriginal whalemen, the Red-Men, first sally out in canoes to
give chase to the Leviathan? And where but from Nantucket, too, did that
first adventurous little sloop put forth, partly laden with imported
cobblestones--so goes the story--to throw at the whales, in order to
discover when they were nigh enough to risk a harpoon from the bowsprit?
Now having a night, a day, and still another night following before me
in New Bedford, ere I could embark for my destined port, it became a
matter of concernment where I was to eat and sleep meanwhile. It was a
very dubious-looking, nay, a very dark and dismal night, bitingly cold
and cheerless. I knew no one in the place. With anxious grapnels I had
sounded my pocket, and only brought up a few pieces of silver,--So,
wherever you go, Ishmael, said I to myself, as I stood in the middle of
a dreary street shouldering my bag, and comparing the gloom towards the
north with the darkness towards the south--wherever in your wisdom you
may conclude to lodge for the night, my dear Ishmael, be sure to inquire
the price, and don't be too particular.
With halting steps I paced the streets, and passed the sign of "The
Crossed Harpoons"--but it looked too expensive and jolly there. Further
on, from the bright red windows of the "Sword-Fish Inn," there came such
fervent rays, that it seemed to have melted the packed snow and ice from
before the house, for everywhere else the congealed frost lay ten inches
thick in a hard, asphaltic pavement,--rather weary for me, when I struck
my foot against the flinty projections, because from hard, remorseless
service the soles of my boots were in a most miserable plight. Too
expensive and jolly, again thought I, pausing one moment to watch the
broad glare in the street, and hear the sounds of the tinkling glasses
within. But go on, Ishmael, said I at last; don't you hear? get away
from before the door; your patched boots are stopping the way. So on I
went. I now by instinct followed the streets that took me waterward, for
there, doubtless, were the cheapest, if not the cheeriest inns.
Such dreary streets! blocks of blackness, not houses, on either hand,
and here and there a candle, like a candle moving about in a tomb. At
this hour of the night, of the last day of the week, that quarter of
the town proved all but deserted. But presently I came to a smoky light
proceeding from a low, wide building, the door of which stood invitingly
open. It had a careless look, as if it were meant for the uses of the
public; so, entering, the first thing I did was to stumble over an
ash-box in the porch. Ha! thought I, ha, as the flying particles almost
choked me, are these ashes from that destroyed city, Gomorrah? But "The
Crossed Harpoons," and "The Sword-Fish?"--this, then must needs be the
sign of "The Trap." However, I picked myself up and hearing a loud voice
within, pushed on and opened a second, interior door.
It seemed the great Black Parliament sitting in Tophet. A hundred black
faces turned round in their rows to peer; and beyond, a black Angel
of Doom was beating a book in a pulpit. It was a negro church; and the
preacher's text was about the blackness of darkness, and the weeping and
wailing and teeth-gnashing there. Ha, Ishmael, muttered I, backing out,
Wretched entertainment at the sign of 'The Trap!'
Moving on, I at last came to a dim sort of light not far from the docks,
and heard a forlorn creaking in the air; and looking up, saw a swinging
sign over the door with a white painting upon it, faintly representing
a tall straight jet of misty spray, and these words underneath--"The
Spouter Inn:--Peter Coffin."
Coffin?--Spouter?--Rather ominous in that particular connexion, thought
I. But it is a common name in Nantucket, they say, and I suppose this
Peter here is an emigrant from there. As the light looked so dim, and
the place, for the time, looked quiet enough, and the dilapidated little
wooden house itself looked as if it might have been carted here from
the ruins of some burnt district, and as the swinging sign had a
poverty-stricken sort of creak to it, I thought that here was the very
spot for cheap lodgings, and the best of pea coffee.
It was a queer sort of place--a gable-ended old house, one side palsied
as it were, and leaning over sadly. It stood on a sharp bleak corner,
where that tempestuous wind Euroclydon kept up a worse howling than ever
it did about poor Paul's tossed craft. Euroclydon, nevertheless, is a
mighty pleasant zephyr to any one in-doors, with his feet on the hob
quietly toasting for bed. "In judging of that tempestuous wind called
Euroclydon," says an old writer--of whose works I possess the only copy
extant--"it maketh a marvellous difference, whether thou lookest out at
it from a glass window where the frost is all on the outside, or whether
thou observest it from that sashless window, where the frost is on both
sides, and of which the wight Death is the only glazier." True enough,
thought I, as this passage occurred to my mind--old black-letter, thou
reasonest well. Yes, these eyes are windows, and this body of mine is
the house. What a pity they didn't stop up the chinks and the crannies
though, and thrust in a little lint here and there. But it's too late
to make any improvements now. The universe is finished; the copestone
is on, and the chips were carted off a million years ago. Poor Lazarus
there, chattering his teeth against the curbstone for his pillow, and
shaking off his tatters with his shiverings, he might plug up both ears
with rags, and put a corn-cob into his mouth, and yet that would not
keep out the tempestuous Euroclydon. Euroclydon! says old Dives, in his
red silken wrapper--(he had a redder one afterwards) pooh, pooh! What
a fine frosty night; how Orion glitters; what northern lights! Let them
talk of their oriental summer climes of everlasting conservatories; give
me the privilege of making my own summer with my own coals.
But what thinks Lazarus? Can he warm his blue hands by holding them up
to the grand northern lights? Would not Lazarus rather be in Sumatra
than here? Would he not far rather lay him down lengthwise along the
line of the equator; yea, ye gods! go down to the fiery pit itself, in
order to keep out this frost?
Now, that Lazarus should lie stranded there on the curbstone before the
door of Dives, this is more wonderful than that an iceberg should be
moored to one of the Moluccas. Yet Dives himself, he too lives like a
Czar in an ice palace made of frozen sighs, and being a president of a
temperance society, he only drinks the tepid tears of orphans.
But no more of this blubbering now, we are going a-whaling, and there is
plenty of that yet to come. Let us scrape the ice from our frosted feet,
and see what sort of a place this "Spouter" may be.
tokenizers/tests/testthat/test-wordcount.R 0000644 0001762 0000144 00000001322 13252224016 020570 0 ustar ligges users context("Word counts")
test_that("Word counts work on lists and character vectors", {
out_l <- count_sentences(docs_l)
out_c <- count_sentences(docs_c)
expect_identical(out_l, out_c)
out_l <- count_words(docs_l)
out_c <- count_words(docs_c)
expect_identical(out_l, out_c)
out_l <- count_characters(docs_l)
out_c <- count_characters(docs_c)
expect_identical(out_l, out_c)
expect_named(out_l, names(docs_l))
expect_named(out_c, names(docs_c))
})
test_that("Word counts give correct results", {
input <- "This input has 10 words; doesn't it? Well---sure does."
expect_equal(10, count_words(input))
expect_equal(2, count_sentences(input))
expect_equal(nchar(input), count_characters(input))
})
tokenizers/tests/testthat/test-basic.R 0000644 0001762 0000144 00000015352 13252224016 017635 0 ustar ligges users context("Basic tokenizers")
test_that("Character tokenizer works as expected", {
out_l <- tokenize_characters(docs_l)
out_c <- tokenize_characters(docs_c)
out_1 <- tokenize_characters(docs_c[1], simplify = TRUE)
expect_is(out_l, "list")
expect_is(out_l[[1]], "character")
expect_is(out_c, "list")
expect_is(out_c[[1]], "character")
expect_is(out_1, "character")
expect_identical(out_l, out_c)
expect_identical(out_l[[1]], out_1)
expect_identical(out_c[[1]], out_1)
expect_named(out_l, names(docs_l))
expect_named(out_c, names(docs_c))
expect_error(tokenize_characters(bad_list))
})
test_that("Character tokenizer produces correct output", {
# skip_on_os("windows")
out_1 <- tokenize_characters(docs_c[1], simplify = TRUE)
expected <- c("c", "h", "a", "p", "t")
expect_identical(head(out_1, 5), expected)
})
test_that("Word tokenizer works as expected", {
out_l <- tokenize_words(docs_l)
out_c <- tokenize_words(docs_c)
out_1 <- tokenize_words(docs_c[1], simplify = TRUE)
expect_is(out_l, "list")
expect_is(out_l[[1]], "character")
expect_is(out_c, "list")
expect_is(out_c[[1]], "character")
expect_is(out_1, "character")
expect_identical(out_l, out_c)
expect_identical(out_l[[1]], out_1)
expect_identical(out_c[[1]], out_1)
expect_named(out_l, names(docs_l))
expect_named(out_c, names(docs_c))
expect_error(tokenize_words(bad_list))
})
test_that("Word tokenizer produces correct output", {
# skip_on_os("windows")
out_1 <- tokenize_words(docs_c[1], simplify = TRUE)
expected <- c("chapter", "1", "loomings", "call", "me")
expect_identical(head(out_1, 5), expected)
})
test_that("Word tokenizer removes stop words", {
test <- "Now is the time for every good person"
test_l <- list(test, test)
stopwords <- c("is", "the", "for")
expected <- c("now", "time", "every", "good", "person")
expected_l <- list(expected, expected)
expect_equal(tokenize_words(test, simplify = TRUE, stopwords = stopwords),
expected)
expect_equal(tokenize_words(test_l, stopwords = stopwords), expected_l)
})
test_that("Word tokenizer can remove punctuation or numbers", {
test_punct <- "This sentence ... has punctuation, doesn't it?"
out_punct <- c("this", "sentence", ".", ".", ".", "has", "punctuation",
",", "doesn't", "it", "?")
test_num <- "In 1968 the GDP was 1.2 trillion."
out_num_f <- c("in", "1968", "the", "gdp", "was", "1.2", "trillion")
out_num_t <- c("in", "the", "gdp", "was", "trillion")
expect_equal(tokenize_words(test_punct, simplify = TRUE, strip_punct = FALSE),
out_punct)
expect_equal(tokenize_words(test_num, simplify = TRUE, strip_numeric = FALSE),
out_num_f)
expect_equal(tokenize_words(test_num, simplify = TRUE, strip_numeric = TRUE),
out_num_t)
})
test_that("Sentence tokenizer works as expected", {
out_l <- tokenize_sentences(docs_l)
out_c <- tokenize_sentences(docs_c)
out_1 <- tokenize_sentences(docs_c[1], simplify = TRUE)
out_1_lc <- tokenize_sentences(docs_c[1], lowercase = TRUE, simplify = TRUE)
out_1_pc <- tokenize_sentences(docs_c[1], strip_punct = TRUE, simplify = TRUE)
expect_is(out_l, "list")
expect_is(out_l[[1]], "character")
expect_is(out_c, "list")
expect_is(out_c[[1]], "character")
expect_is(out_1, "character")
expect_identical(out_l, out_c)
expect_identical(out_l[[1]], out_1)
expect_identical(out_c[[1]], out_1)
expect_named(out_l, names(docs_l))
expect_named(out_c, names(docs_c))
expect_error(tokenize_sentences(bad_list))
})
test_that("Sentence tokenizer produces correct output", {
# skip_on_os("windows")
out_1 <- tokenize_sentences(docs_c[1], simplify = TRUE)
out_1_lc <- tokenize_sentences(docs_c[1], lowercase = TRUE, simplify = TRUE)
out_1_pc <- tokenize_sentences(docs_c[1], strip_punct = TRUE, simplify = TRUE)
expected <- c("CHAPTER 1.", "Loomings.", "Call me Ishmael.")
expected_pc <- c("CHAPTER 1", "Loomings", "Call me Ishmael")
expect_identical(head(out_1, 3), expected)
expect_identical(head(out_1_lc, 3), tolower(expected))
expect_identical(head(out_1_pc, 3), expected_pc)
})
test_that("Line tokenizer works as expected", {
out_l <- tokenize_lines(docs_l)
out_c <- tokenize_lines(docs_c)
out_1 <- tokenize_lines(docs_c[1], simplify = TRUE)
expect_is(out_l, "list")
expect_is(out_l[[1]], "character")
expect_is(out_c, "list")
expect_is(out_c[[1]], "character")
expect_is(out_1, "character")
expect_identical(out_l, out_c)
expect_identical(out_l[[1]], out_1)
expect_identical(out_c[[1]], out_1)
expect_named(out_l, names(docs_l))
expect_named(out_c, names(docs_c))
expect_error(tokenize_lines(bad_list))
})
test_that("Sentence tokenizer produces correct output", {
# skip_on_os("windows")
out_1 <- tokenize_lines(docs_c[1], simplify = TRUE)
expected <- c("CHAPTER 1. Loomings.",
"Call me Ishmael. Some years ago--never mind how long precisely--having")
expect_identical(head(out_1, 2), expected)
})
test_that("Paragraph tokenizer works as expected", {
out_l <- tokenize_paragraphs(docs_l)
out_c <- tokenize_paragraphs(docs_c)
out_1 <- tokenize_paragraphs(docs_c[1], simplify = TRUE)
expect_is(out_l, "list")
expect_is(out_l[[1]], "character")
expect_is(out_c, "list")
expect_is(out_c[[1]], "character")
expect_is(out_1, "character")
expect_identical(out_l, out_c)
expect_identical(out_l[[1]], out_1)
expect_identical(out_c[[1]], out_1)
expect_named(out_l, names(docs_l))
expect_named(out_c, names(docs_c))
expect_error(tokenize_paragraphs(bad_list))
})
test_that("Paragraph tokenizer produces correct output", {
# skip_on_os("windows")
out_1 <- tokenize_paragraphs(docs_c[1], simplify = TRUE)
expected <- c("There now is your insular city of the Manhattoes")
expect_true(grepl(expected, out_1[3]))
})
test_that("Regex tokenizer works as expected", {
out_l <- tokenize_regex(docs_l, pattern = "[[:punct:]\n]")
out_c <- tokenize_regex(docs_c, pattern = "[[:punct:]\n]")
out_1 <- tokenize_regex(docs_c[1], pattern = "[[:punct:]\n]", simplify = TRUE)
expect_is(out_l, "list")
expect_is(out_l[[1]], "character")
expect_is(out_c, "list")
expect_is(out_c[[1]], "character")
expect_is(out_1, "character")
expect_identical(out_l, out_c)
expect_identical(out_l[[1]], out_1)
expect_identical(out_c[[1]], out_1)
expect_named(out_l, names(docs_l))
expect_named(out_c, names(docs_c))
expect_error(tokenize_paragraphs(bad_list))
})
test_that("Regex tokenizer produces correct output", {
# skip_on_os("windows")
out_1 <- tokenize_regex(docs_c[1], pattern = "[[:punct:]\n]", simplify = TRUE)
expected <- c("CHAPTER 1", " Loomings", "Call me Ishmael", " Some years ago",
"never mind how long precisely")
expect_identical(head(out_1, 5), expected)
}) tokenizers/tests/testthat/test-tif.R 0000644 0001762 0000144 00000004403 13256545214 017343 0 ustar ligges users context("Text Interchange Format")
test_that("Can detect a TIF compliant data.frame", {
expect_true(is_corpus_df(docs_df))
bad_df <- docs_df
bad_df$doc_id <- NULL
expect_error(is_corpus_df(bad_df))
})
test_that("Can coerce a TIF compliant data.frame to a character vector", {
output <- docs_df$text
names(output) <- docs_df$doc_id
expect_identical(corpus_df_as_corpus_vector(docs_df), output)
})
test_that("Different methods produce identical output", {
expect_identical(tokenize_words(docs_c), tokenize_words(docs_df))
expect_identical(tokenize_words(docs_l), tokenize_words(docs_df))
expect_identical(tokenize_characters(docs_c), tokenize_characters(docs_df))
expect_identical(tokenize_characters(docs_l), tokenize_characters(docs_df))
expect_identical(tokenize_sentences(docs_c), tokenize_sentences(docs_df))
expect_identical(tokenize_sentences(docs_l), tokenize_sentences(docs_df))
expect_identical(tokenize_lines(docs_c), tokenize_lines(docs_df))
expect_identical(tokenize_lines(docs_l), tokenize_lines(docs_df))
expect_identical(tokenize_paragraphs(docs_c), tokenize_paragraphs(docs_df))
expect_identical(tokenize_paragraphs(docs_l), tokenize_paragraphs(docs_df))
expect_identical(tokenize_regex(docs_c), tokenize_regex(docs_df))
expect_identical(tokenize_regex(docs_l), tokenize_regex(docs_df))
expect_identical(tokenize_tweets(docs_c), tokenize_tweets(docs_df))
expect_identical(tokenize_tweets(docs_l), tokenize_tweets(docs_df))
expect_identical(tokenize_ngrams(docs_c), tokenize_ngrams(docs_df))
expect_identical(tokenize_ngrams(docs_l), tokenize_ngrams(docs_df))
expect_identical(tokenize_skip_ngrams(docs_c), tokenize_skip_ngrams(docs_df))
expect_identical(tokenize_skip_ngrams(docs_l), tokenize_skip_ngrams(docs_df))
expect_identical(tokenize_ptb(docs_c), tokenize_ptb(docs_df))
expect_identical(tokenize_ptb(docs_l), tokenize_ptb(docs_df))
expect_identical(tokenize_character_shingles(docs_c),
tokenize_character_shingles(docs_df))
expect_identical(tokenize_character_shingles(docs_l),
tokenize_character_shingles(docs_df))
expect_identical(tokenize_word_stems(docs_c), tokenize_word_stems(docs_df))
expect_identical(tokenize_word_stems(docs_l), tokenize_word_stems(docs_df))
})
tokenizers/tests/testthat/test-ptb.R 0000644 0001762 0000144 00000002441 13252224016 017334 0 ustar ligges users context("PTB tokenizer")
test_that("PTB tokenizer works as expected", {
out_l <- tokenize_ptb(docs_l)
out_c <- tokenize_ptb(docs_c)
out_1 <- tokenize_ptb(docs_c[1], simplify = TRUE)
expect_is(out_l, "list")
expect_is(out_l[[1]], "character")
expect_is(out_c, "list")
expect_is(out_c[[1]], "character")
expect_is(out_1, "character")
expect_identical(out_l, out_c)
expect_identical(out_l[[1]], out_1)
expect_identical(out_c[[1]], out_1)
expect_named(out_l, names(docs_l))
expect_named(out_c, names(docs_c))
expect_error(tokenize_ptb(bad_list))
})
test_that("Word tokenizer produces correct output", {
sents <-
c(paste0("Good muffins cost $3.88\nin New York. ",
"Please buy me\\ntwo of them.\\nThanks."),
"They'll save and invest more." ,
"hi, my name can't hello,")
expected <-
list(c("Good", "muffins", "cost", "$", "3.88", "in", "New", "York.",
"Please", "buy", "me\\ntwo", "of", "them.\\nThanks", "."),
c("They", "'ll", "save", "and", "invest", "more", "."),
c("hi", ",", "my", "name", "ca", "n't", "hello", ","))
expect_identical(tokenize_ptb(sents), expected)
expect_identical(tokenize_ptb("This can't work.", lowercase = TRUE, simplify = TRUE),
c("this", "ca", "n't", "work", "."))
})
tokenizers/tests/testthat/test-tokenize_tweets.R 0000644 0001762 0000144 00000005306 13252224016 021775 0 ustar ligges users context("Tweet tokenizer")
test_that("tweet tokenizer works correctly with case", {
txt <- c(t1 = "Try this: tokenizers at @rOpenSci https://twitter.com/search?q=ropensci&src=typd",
t2 = "#rstats awesome Package! @rOpenSci",
t3 = "one two three Four #FIVE")
out_tw1 <- tokenize_tweets(txt, lowercase = TRUE)
expect_identical(out_tw1$t2, c("#rstats", "awesome", "package", "@rOpenSci"))
expect_identical(out_tw1$t3, c("one", "two", "three", "four", "#FIVE"))
out_tw2 <- tokenize_tweets(txt, lowercase = FALSE)
expect_identical(out_tw2$t2, c("#rstats", "awesome", "Package", "@rOpenSci"))
expect_identical(out_tw2$t3, c("one", "two", "three", "Four", "#FIVE"))
})
test_that("tweet tokenizer works correctly with strip_punctuation", {
txt <- c(t1 = "Try this: tokenizers at @rOpenSci https://twitter.com/search?q=ropensci&src=typd",
t2 = "#rstats awesome Package! @rOpenSci",
t3 = "one two three Four #FIVE")
out_tw1 <- tokenize_tweets(txt, strip_punct = TRUE, lowercase = TRUE)
expect_identical(out_tw1$t2, c("#rstats", "awesome", "package", "@rOpenSci"))
expect_identical(out_tw1$t3, c("one", "two", "three", "four", "#FIVE"))
out_tw2 <- tokenize_tweets(txt, strip_punct = FALSE, lowercase = TRUE)
expect_identical(
out_tw2$t1,
c("try", "this", ":", "tokenizers", "at", "@rOpenSci", "https://twitter.com/search?q=ropensci&src=typd")
)
})
test_that("tweet tokenizer works correctly with strip_url", {
txt <- c(t1 = "Tokenizers at @rOpenSci https://twitter.com/search?q=ropensci&src=typd")
out_tw1 <- tokenize_tweets(txt, strip_punct = TRUE, strip_url = FALSE)
expect_identical(
out_tw1$t1,
c("tokenizers", "at", "@rOpenSci", "https://twitter.com/search?q=ropensci&src=typd")
)
out_tw2 <- tokenize_tweets(txt, strip_punct = TRUE, strip_url = TRUE)
expect_identical(
out_tw2$t1,
c("tokenizers", "at", "@rOpenSci")
)
})
test_that("names are preserved with tweet tokenizer", {
expect_equal(
names(tokenize_tweets(c(t1 = "Larry, moe, and curly", t2 = "@ropensci #rstats"))),
c("t1", "t2")
)
expect_equal(
names(tokenize_tweets(c("Larry, moe, and curly", "@ropensci #rstats"))),
NULL
)
})
test_that("punctuation as part of tweets can preserved", {
txt <- c(t1 = "We love #rstats!",
t2 = "@rOpenSci: See you at UseR!")
expect_equal(
tokenize_tweets(txt, strip_punct = FALSE, lowercase = FALSE),
list(t1 = c("We", "love", "#rstats", "!"),
t2 = c("@rOpenSci", ":", "See", "you", "at", "UseR", "!"))
)
expect_equal(
tokenize_tweets(txt, strip_punct = TRUE, lowercase = FALSE),
list(t1 = c("We", "love", "#rstats"),
t2 = c("@rOpenSci", "See", "you", "at", "UseR"))
)
})
tokenizers/tests/testthat/test-chunking.R 0000644 0001762 0000144 00000002241 13252224016 020353 0 ustar ligges users context("Document chunking")
test_that("Document chunking work on lists and character vectors", {
chunk_size <- 10
out_l <- chunk_text(docs_l, chunk_size = chunk_size)
out_c <- chunk_text(docs_c, chunk_size = chunk_size)
expect_is(out_l, "list")
expect_is(out_l[[1]], "character")
expect_is(out_c, "list")
expect_is(out_c[[1]], "character")
expect_identical(out_l, out_c)
expect_identical(out_l[[1]], out_c[[1]])
expect_identical(out_c[[1]], out_c[[1]])
expect_named(out_l, names(out_c))
expect_named(out_c, names(out_l))
expect_error(chunk_text(bad_list))
})
test_that("Document chunking splits documents apart correctly", {
test_doc <- "This is a sentence with exactly eight words. Here's two. And now here are ten words in a great sentence. And five or six left over."
out <- chunk_text(test_doc, chunk_size = 10, doc_id = "test")
out_wc <- count_words(out)
test_wc <- c(10L, 10L, 6L)
names(test_wc) <- c("test-1", "test-2", "test-3")
expect_named(out, names(test_wc))
expect_identical(out_wc, test_wc)
out_short <- chunk_text("This is a short text")
expect_equal(count_words(out_short[[1]]), 5)
expect_named(out_short, NULL)
})
tokenizers/tests/testthat/moby-ch1.txt 0000644 0001762 0000144 00000027722 12775200571 017651 0 ustar ligges users CHAPTER 1. Loomings.
Call me Ishmael. Some years ago--never mind how long precisely--having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world. It is a way I have of driving off the spleen and regulating
the circulation. Whenever I find myself growing grim about the mouth;
whenever it is a damp, drizzly November in my soul; whenever I find
myself involuntarily pausing before coffin warehouses, and bringing up
the rear of every funeral I meet; and especially whenever my hypos get
such an upper hand of me, that it requires a strong moral principle to
prevent me from deliberately stepping into the street, and methodically
knocking people's hats off--then, I account it high time to get to sea
as soon as I can. This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I quietly
take to the ship. There is nothing surprising in this. If they but knew
it, almost all men in their degree, some time or other, cherish very
nearly the same feelings towards the ocean with me.
There now is your insular city of the Manhattoes, belted round by
wharves as Indian isles by coral reefs--commerce surrounds it with
her surf. Right and left, the streets take you waterward. Its extreme
downtown is the battery, where that noble mole is washed by waves, and
cooled by breezes, which a few hours previous were out of sight of land.
Look at the crowds of water-gazers there.
Circumambulate the city of a dreamy Sabbath afternoon. Go from Corlears
Hook to Coenties Slip, and from thence, by Whitehall, northward. What
do you see?--Posted like silent sentinels all around the town, stand
thousands upon thousands of mortal men fixed in ocean reveries. Some
leaning against the spiles; some seated upon the pier-heads; some
looking over the bulwarks of ships from China; some high aloft in the
rigging, as if striving to get a still better seaward peep. But these
are all landsmen; of week days pent up in lath and plaster--tied to
counters, nailed to benches, clinched to desks. How then is this? Are
the green fields gone? What do they here?
But look! here come more crowds, pacing straight for the water, and
seemingly bound for a dive. Strange! Nothing will content them but the
extremest limit of the land; loitering under the shady lee of yonder
warehouses will not suffice. No. They must get just as nigh the water
as they possibly can without falling in. And there they stand--miles of
them--leagues. Inlanders all, they come from lanes and alleys, streets
and avenues--north, east, south, and west. Yet here they all unite.
Tell me, does the magnetic virtue of the needles of the compasses of all
those ships attract them thither?
Once more. Say you are in the country; in some high land of lakes. Take
almost any path you please, and ten to one it carries you down in a
dale, and leaves you there by a pool in the stream. There is magic
in it. Let the most absent-minded of men be plunged in his deepest
reveries--stand that man on his legs, set his feet a-going, and he will
infallibly lead you to water, if water there be in all that region.
Should you ever be athirst in the great American desert, try this
experiment, if your caravan happen to be supplied with a metaphysical
professor. Yes, as every one knows, meditation and water are wedded for
ever.
But here is an artist. He desires to paint you the dreamiest, shadiest,
quietest, most enchanting bit of romantic landscape in all the valley of
the Saco. What is the chief element he employs? There stand his trees,
each with a hollow trunk, as if a hermit and a crucifix were within; and
here sleeps his meadow, and there sleep his cattle; and up from yonder
cottage goes a sleepy smoke. Deep into distant woodlands winds a
mazy way, reaching to overlapping spurs of mountains bathed in their
hill-side blue. But though the picture lies thus tranced, and though
this pine-tree shakes down its sighs like leaves upon this shepherd's
head, yet all were vain, unless the shepherd's eye were fixed upon the
magic stream before him. Go visit the Prairies in June, when for scores
on scores of miles you wade knee-deep among Tiger-lilies--what is the
one charm wanting?--Water--there is not a drop of water there! Were
Niagara but a cataract of sand, would you travel your thousand miles to
see it? Why did the poor poet of Tennessee, upon suddenly receiving two
handfuls of silver, deliberate whether to buy him a coat, which he sadly
needed, or invest his money in a pedestrian trip to Rockaway Beach? Why
is almost every robust healthy boy with a robust healthy soul in him, at
some time or other crazy to go to sea? Why upon your first voyage as a
passenger, did you yourself feel such a mystical vibration, when first
told that you and your ship were now out of sight of land? Why did the
old Persians hold the sea holy? Why did the Greeks give it a separate
deity, and own brother of Jove? Surely all this is not without meaning.
And still deeper the meaning of that story of Narcissus, who because
he could not grasp the tormenting, mild image he saw in the fountain,
plunged into it and was drowned. But that same image, we ourselves see
in all rivers and oceans. It is the image of the ungraspable phantom of
life; and this is the key to it all.
Now, when I say that I am in the habit of going to sea whenever I begin
to grow hazy about the eyes, and begin to be over conscious of my lungs,
I do not mean to have it inferred that I ever go to sea as a passenger.
For to go as a passenger you must needs have a purse, and a purse is
but a rag unless you have something in it. Besides, passengers get
sea-sick--grow quarrelsome--don't sleep of nights--do not enjoy
themselves much, as a general thing;--no, I never go as a passenger;
nor, though I am something of a salt, do I ever go to sea as a
Commodore, or a Captain, or a Cook. I abandon the glory and distinction
of such offices to those who like them. For my part, I abominate all
honourable respectable toils, trials, and tribulations of every kind
whatsoever. It is quite as much as I can do to take care of myself,
without taking care of ships, barques, brigs, schooners, and what not.
And as for going as cook,--though I confess there is considerable glory
in that, a cook being a sort of officer on ship-board--yet, somehow,
I never fancied broiling fowls;--though once broiled, judiciously
buttered, and judgmatically salted and peppered, there is no one who
will speak more respectfully, not to say reverentially, of a broiled
fowl than I will. It is out of the idolatrous dotings of the old
Egyptians upon broiled ibis and roasted river horse, that you see the
mummies of those creatures in their huge bake-houses the pyramids.
No, when I go to sea, I go as a simple sailor, right before the mast,
plumb down into the forecastle, aloft there to the royal mast-head.
True, they rather order me about some, and make me jump from spar to
spar, like a grasshopper in a May meadow. And at first, this sort
of thing is unpleasant enough. It touches one's sense of honour,
particularly if you come of an old established family in the land, the
Van Rensselaers, or Randolphs, or Hardicanutes. And more than all,
if just previous to putting your hand into the tar-pot, you have been
lording it as a country schoolmaster, making the tallest boys stand
in awe of you. The transition is a keen one, I assure you, from a
schoolmaster to a sailor, and requires a strong decoction of Seneca and
the Stoics to enable you to grin and bear it. But even this wears off in
time.
What of it, if some old hunks of a sea-captain orders me to get a broom
and sweep down the decks? What does that indignity amount to, weighed,
I mean, in the scales of the New Testament? Do you think the archangel
Gabriel thinks anything the less of me, because I promptly and
respectfully obey that old hunks in that particular instance? Who ain't
a slave? Tell me that. Well, then, however the old sea-captains may
order me about--however they may thump and punch me about, I have the
satisfaction of knowing that it is all right; that everybody else is
one way or other served in much the same way--either in a physical
or metaphysical point of view, that is; and so the universal thump is
passed round, and all hands should rub each other's shoulder-blades, and
be content.
Again, I always go to sea as a sailor, because they make a point of
paying me for my trouble, whereas they never pay passengers a single
penny that I ever heard of. On the contrary, passengers themselves must
pay. And there is all the difference in the world between paying
and being paid. The act of paying is perhaps the most uncomfortable
infliction that the two orchard thieves entailed upon us. But BEING
PAID,--what will compare with it? The urbane activity with which a man
receives money is really marvellous, considering that we so earnestly
believe money to be the root of all earthly ills, and that on no account
can a monied man enter heaven. Ah! how cheerfully we consign ourselves
to perdition!
Finally, I always go to sea as a sailor, because of the wholesome
exercise and pure air of the fore-castle deck. For as in this world,
head winds are far more prevalent than winds from astern (that is,
if you never violate the Pythagorean maxim), so for the most part the
Commodore on the quarter-deck gets his atmosphere at second hand from
the sailors on the forecastle. He thinks he breathes it first; but not
so. In much the same way do the commonalty lead their leaders in many
other things, at the same time that the leaders little suspect it.
But wherefore it was that after having repeatedly smelt the sea as a
merchant sailor, I should now take it into my head to go on a whaling
voyage; this the invisible police officer of the Fates, who has the
constant surveillance of me, and secretly dogs me, and influences me
in some unaccountable way--he can better answer than any one else. And,
doubtless, my going on this whaling voyage, formed part of the grand
programme of Providence that was drawn up a long time ago. It came in as
a sort of brief interlude and solo between more extensive performances.
I take it that this part of the bill must have run something like this:
"GRAND CONTESTED ELECTION FOR THE PRESIDENCY OF THE UNITED STATES.
"WHALING VOYAGE BY ONE ISHMAEL.
"BLOODY BATTLE IN AFFGHANISTAN."
Though I cannot tell why it was exactly that those stage managers, the
Fates, put me down for this shabby part of a whaling voyage, when others
were set down for magnificent parts in high tragedies, and short and
easy parts in genteel comedies, and jolly parts in farces--though
I cannot tell why this was exactly; yet, now that I recall all the
circumstances, I think I can see a little into the springs and motives
which being cunningly presented to me under various disguises, induced
me to set about performing the part I did, besides cajoling me into the
delusion that it was a choice resulting from my own unbiased freewill
and discriminating judgment.
Chief among these motives was the overwhelming idea of the great
whale himself. Such a portentous and mysterious monster roused all my
curiosity. Then the wild and distant seas where he rolled his island
bulk; the undeliverable, nameless perils of the whale; these, with all
the attending marvels of a thousand Patagonian sights and sounds, helped
to sway me to my wish. With other men, perhaps, such things would not
have been inducements; but as for me, I am tormented with an everlasting
itch for things remote. I love to sail forbidden seas, and land on
barbarous coasts. Not ignoring what is good, I am quick to perceive a
horror, and could still be social with it--would they let me--since it
is but well to be on friendly terms with all the inmates of the place
one lodges in.
By reason of these things, then, the whaling voyage was welcome; the
great flood-gates of the wonder-world swung open, and in the wild
conceits that swayed me to my purpose, two and two there floated into
my inmost soul, endless processions of the whale, and, mid most of them
all, one grand hooded phantom, like a snow hill in the air.
tokenizers/tests/testthat/moby-ch3.txt 0000644 0001762 0000144 00000077762 12775200571 017664 0 ustar ligges users CHAPTER 3
The Spouter-Inn
Entering that gable-ended Spouter-Inn, you found yourself
in a wide, low, straggling entry with old-fashioned wainscots,
reminding one of the bulwarks of some condemned old craft.
On one side hung a very large oil painting so thoroughly besmoked,
and every way defaced, that in the unequal crosslights by which
you viewed it, it was only by diligent study and a series of
systematic visits to it, and careful inquiry of the neighbors,
that you could any way arrive at an understanding of its purpose.
Such unaccountable masses of shades and shadows, that at
first you almost thought some ambitious young artist,
in the time of the New England hags, had endeavored to delineate
chaos bewitched. But by dint of much and earnest contemplation,
and oft repeated ponderings, and especially by throwing open
the little window towards the back of the entry, you at last
come to the conclusion that such an idea, however wild,
might not be altogether unwarranted.
But what most puzzled and confounded you was a long, limber, portentous,
black mass of something hovering in the centre of the picture over
three blue, dim, perpendicular lines floating in a nameless yeast.
A boggy, soggy, squitchy picture truly, enough to drive
a nervous man distracted. Yet was there a sort of indefinite,
half-attained, unimaginable sublimity about it that fairly froze
you to it, till you involuntarily took an oath with yourself
to find out what that marvellous painting meant. Ever and anon
a bright, but, alas, deceptive idea would dart you through.--
It's the Black Sea in a midnight gale.--It's the unnatural
combat of the four primal elements.--It's a blasted heath.--
It's a Hyperborean winter scene.--It's the breaking-up of
the icebound stream of Time. But at last all these fancies
yielded to that one portentous something in the picture's midst.
That once found out, and all the rest were plain. But stop;
does it not bear a faint resemblance to a gigantic fish? even
the great leviathan himself?
In fact, the artist's design seemed this: a final theory of my own,
partly based upon the aggregated opinions of many aged persons
with whom I conversed upon the subject. The picture represents
a Cape-Horner in a great hurricane; the half-foundered ship
weltering there with its three dismantled masts alone visible;
and an exasperated whale, purposing to spring clean over the craft,
is in the enormous act of impaling himself upon the three mast-heads.
The opposite wall of this entry was hung all over with a heathenish array
of monstrous clubs and spears. Some were thickly set with glittering
teeth resembling ivory saws; others were tufted with knots of human hair;
and one was sickle-shaped, with a vast handle sweeping round
like the segment made in the new-mown grass by a long-armed mower.
You shuddered as you gazed, and wondered what monstrous cannibal
and savage could ever have gone a death-harvesting with such a hacking,
horrifying implement. Mixed with these were rusty old whaling lances
and harpoons all broken and deformed. Some were storied weapons.
With this once long lance, now wildly elbowed, fifty years ago did
Nathan Swain kill fifteen whales between a sunrise and a sunset.
And that harpoon--so like a corkscrew now--was flung in Javan seas,
and run away with by a whale, years afterwards slain off the Cape
of Blanco. The original iron entered nigh the tail, and, like a restless
needle sojourning in the body of a man, travelled full forty feet,
and at last was found imbedded in the hump.
Crossing this dusky entry, and on through yon low-arched way--
cut through what in old times must have been a great central
chimney with fireplaces all round--you enter the public room.
A still duskier place is this, with such low ponderous
beams above, and such old wrinkled planks beneath, that you
would almost fancy you trod some old craft's cockpits,
especially of such a howling night, when this corner-anchored
old ark rocked so furiously. On one side stood a long, low,
shelf-like table covered with cracked glass cases, filled with
dusty rarities gathered from this wide world's remotest nooks.
Projecting from the further angle of the room stands a
dark-looking den--the bar--a rude attempt at a right whale's head.
Be that how it may, there stands the vast arched bone of the
whale's jaw, so wide, a coach might almost drive beneath it.
Within are shabby shelves, ranged round with old decanters,
bottles, flasks; and in those jaws of swift destruction,
like another cursed Jonah (by which name indeed they called
him), bustles a little withered old man, who, for their money,
dearly sells the sailors deliriums and death.
Abominable are the tumblers into which he pours his poison.
Though true cylinders without--within, the villanous green goggling
glasses deceitfully tapered downwards to a cheating bottom.
Parallel meridians rudely pecked into the glass, surround
these footpads' goblets. Fill to this mark, and your charge is
but a penny; to this a penny more; and so on to the full glass--
the Cape Horn measure, which you may gulp down for a shilling.
Upon entering the place I found a number of young seamen gathered about
a table, examining by a dim light divers specimens of skrimshander.
I sought the landlord, and telling him I desired to be accommodated
with a room, received for answer that his house was full--
not a bed unoccupied. "But avast," he added, tapping his forehead,
"you haint no objections to sharing a harpooneer's blanket, have ye?
I s'pose you are goin' a-whalin', so you'd better get used to that
sort of thing."
I told him that I never liked to sleep two in a bed; that if I
should ever do so, it would depend upon who the harpooneer might be,
and that if he (the landlord) really had no other place for me,
and the harpooneer was not decidedly objectionable, why rather
than wander further about a strange town on so bitter a night,
I would put up with the half of any decent man's blanket.
"I thought so. All right; take a seat. Supper?--you want supper?
Supper'll be ready directly."
I sat down on an old wooden settle, carved all over like a
bench on the Battery. At one end a ruminating tar was still
further adorning it with his jack-knife, stooping over
and diligently working away at the space between his legs.
He was trying his hand at a ship under full sail, but he didn't
make much headway, I thought.
At last some four or five of us were summoned to our
meal in an adjoining room. It was cold as Iceland--
no fire at all--the landlord said he couldn't afford it.
Nothing but two dismal tallow candles, each in a winding sheet.
We were fain to button up our monkey jackets, and hold to our
lips cups of scalding tea with our half frozen fingers.
But the fare was of the most substantial kind--not only meat
and potatoes, but dumplings; good heavens! dumplings for supper!
One young fellow in a green box coat, addressed himself
to these dumplings in a most direful manner.
"My boy," said the landlord, "you'll have the nightmare
to a dead sartainty."
"Landlord," I whispered, "that aint the harpooneer is it?"
"Oh, no," said he, looking a sort of diabolically funny, "the harpooneer
is a dark complexioned chap. He never eats dumplings, he don't--
he eats nothing but steaks, and he likes 'em rare."
"The devil he does," says I. "Where is that harpooneer?
Is he here?"
"He'll be here afore long," was the answer.
I could not help it, but I began to feel suspicious of this
"dark complexioned" harpooneer. At any rate, I made up my
mind that if it so turned out that we should sleep together,
he must undress and get into bed before I did.
Supper over, the company went back to the bar-room, when,
knowing not what else to do with myself, I resolved to spend
the rest of the evening as a looker on.
Presently a rioting noise was heard without. Starting up,
the landlord cried, "That's the Grampus's crew. I seed her reported
in the offing this morning; a three years' voyage, and a full ship.
Hurrah, boys; now we'll have the latest news from the Feegees."
A tramping of sea boots was heard in the entry; the door was flung open,
and in rolled a wild set of mariners enough. Enveloped in their shaggy
watch coats, and with their heads muffled in woollen comforters,
all bedarned and ragged, and their beards stiff with icicles,
they seemed an eruption of bears from Labrador. They had just
landed from their boat, and this was the first house they entered.
No wonder, then, that they made a straight wake for the whale's mouth--
the bar--when the wrinkled little old Jonah, there officiating,
soon poured them out brimmers all round. One complained of a bad
cold in his head, upon which Jonah mixed him a pitch-like potion
of gin and molasses, which he swore was a sovereign cure for all
colds and catarrhs whatsoever, never mind of how long standing,
or whether caught off the coast of Labrador, or on the weather side
of an ice-island.
The liquor soon mounted into their heads, as it generally
does even with the arrantest topers newly landed from sea,
and they began capering about most obstreperously.
I observed, however, that one of them held somewhat aloof,
and though he seemed desirous not to spoil the hilarity of his
shipmates by his own sober face, yet upon the whole he refrained from
making as much noise as the rest. This man interested me at once;
and since the sea-gods had ordained that he should soon become my shipmate
(though but a sleeping partner one, so far as this narrative is
concerned), I will here venture upon a little description of him.
He stood full six feet in height, with noble shoulders, and a chest
like a coffer-dam. I have seldom seen such brawn in a man.
His face was deeply brown and burnt, making his white teeth
dazzling by the contrast; while in the deep shadows of his eyes
floated some reminiscences that did not seem to give him much joy.
His voice at once announced that he was a Southerner, and from his
fine stature, I thought he must be one of those tall mountaineers
from the Alleghanian Ridge in Virginia. When the revelry of his
companions had mounted to its height, this man slipped away unobserved,
and I saw no more of him till he became my comrade on the sea.
In a few minutes, however, he was missed by his shipmates,
and being, it seems, for some reason a huge favorite with them,
they raised a cry of "Bulkington! Bulkington! where's Bulkington?"
and darted out of the house in pursuit of him.
It was now about nine o'clock, and the room seeming almost
supernaturally quiet after these orgies, I began to congratulate
myself upon a little plan that had occurred to me just previous
to the entrance of the seamen.
No man prefers to sleep two in a bed. In fact, you would
a good deal rather not sleep with your own brother. I don't know
how it is, but people like to be private when they are sleeping.
And when it comes to sleeping with an unknown stranger,
in a strange inn, in a strange town, and that stranger
a harpooneer, then your objections indefinitely multiply.
Nor was there any earthly reason why I as a sailor should sleep
two in a bed, more than anybody else; for sailors no more
sleep two in a bed at sea, than bachelor Kings do ashore.
To be sure they all sleep together in one apartment, but you
have your own hammock, and cover yourself with your own blanket,
and sleep in your own skin.
The more I pondered over this harpooneer, the more I abominated
the thought of sleeping with him. It was fair to presume that
being a harpooneer, his linen or woollen, as the case might be,
would not be of the tidiest, certainly none of the finest.
I began to twitch all over. Besides, it was getting late,
and my decent harpooneer ought to be home and going bedwards.
Suppose now, he should tumble in upon me at midnight--
how could I tell from what vile hole he had been coming?
"Landlord! I've changed my mind about that harpooneer.--
I shan't sleep with him. I'll try the bench here."
"Just as you please; I'm sorry I cant spare ye a tablecloth for
a mattress, and it's a plaguy rough board here"--feeling of the knots
and notches. "But wait a bit, Skrimshander; I've got a carpenter's
plane there in the bar--wait, I say, and I'll make ye snug enough."
So saying he procured the plane; and with his old silk handkerchief
first dusting the bench, vigorously set to planing away at my bed,
the while grinning like an ape. The shavings flew right and left;
till at last the plane-iron came bump against an indestructible knot.
The landlord was near spraining his wrist, and I told him for heaven's
sake to quit--the bed was soft enough to suit me, and I did not know
how all the planing in the world could make eider down of a pine plank.
So gathering up the shavings with another grin, and throwing them into
the great stove in the middle of the room, he went about his business,
and left me in a brown study.
I now took the measure of the bench, and found that it was
a foot too short; but that could be mended with a chair.
But it was a foot too narrow, and the other bench in
the room was about four inches higher than the planed one--
so there was no yoking them. I then placed the first bench
lengthwise along the only clear space against the wall,
leaving a little interval between, for my back to settle down in.
But I soon found that there came such a draught of cold air
over me from under the sill of the window, that this plan would
never do at all, especially as another current from the rickety
door met the one from the window, and both together formed
a series of small whirlwinds in the immediate vicinity of the spot
where I had thought to spend the night.
The devil fetch that harpooneer, thought I, but stop,
couldn't I steal a march on him--bolt his door inside, and jump
into his bed, not to be wakened by the most violent knockings?
It seemed no bad idea but upon second thoughts I dismissed it.
For who could tell but what the next morning, so soon as I popped
out of the room, the harpooneer might be standing in the entry,
all ready to knock me down!
Still looking around me again, and seeing no possible chance
of spending a sufferable night unless in some other person's bed,
I began to think that after all I might be cherishing
unwarrantable prejudices against this unknown harpooneer.
Thinks I, I'll wait awhile; he must be dropping in before long.
I'll have a good look at him then, and perhaps we may become
jolly good bedfellows after all--there's no telling.
But though the other boarders kept coming in by ones, twos, and threes,
and going to bed, yet no sign of my harpooneer.
"Landlord! said I, "what sort of a chap is he--does he always
keep such late hours?" It was now hard upon twelve o'clock.
The landlord chuckled again with his lean chuckle, and seemed
to be mightily tickled at something beyond my comprehension.
"No," he answered, "generally he's an early bird--airley to bed
and airley to rise--yea, he's the bird what catches the worm.
But to-night he went out a peddling, you see, and I don't see
what on airth keeps him so late, unless, may be, he can't
sell his head."
"Can't sell his head?--What sort of a bamboozingly story
is this you are telling me?" getting into a towering rage.
"Do you pretend to say, landlord, that this harpooneer is actually
engaged this blessed Saturday night, or rather Sunday morning,
in peddling his head around this town?"
"That's precisely it," said the landlord, "and I told him he couldn't
sell it here, the market's overstocked."
"With what?" shouted I.
"With heads to be sure; ain't there too many heads in the world?"
"I tell you what it is, landlord," said I quite calmly,
"you'd better stop spinning that yarn to me--I'm not green."
"May be not," taking out a stick and whittling a toothpick,
"but I rayther guess you'll be done brown if that ere harpooneer
hears you a slanderin' his head."
"I'll break it for him," said I, now flying into a passion again
at this unaccountable farrago of the landlord's.
"It's broke a'ready," said he.
"Broke," said I--"broke, do you mean?"
"Sartain, and that's the very reason he can't sell it, I guess."
"Landlord," said I, going up to him as cool as Mt. Hecla in a
snowstorm--"landlord, stop whittling. You and I must understand
one another, and that too without delay. I come to your house
and want a bed; you tell me you can only give me half a one;
that the other half belongs to a certain harpooneer.
And about this harpooneer, whom I have not yet seen, you persist
in telling me the most mystifying and exasperating stories tending
to beget in me an uncomfortable feeling towards the man whom you
design for my bedfellow--a sort of connexion, landlord, which is
an intimate and confidential one in the highest degree.
I now demand of you to speak out and tell me who and what this
harpooneer is, and whether I shall be in all respects safe
to spend the night with him. And in the first place, you will
be so good as to unsay that story about selling his head,
which if true I take to be good evidence that this harpooneer
is stark mad, and I've no idea of sleeping with a madman;
and you, sir, you I mean, landlord, you, sir, by trying to induce
me to do so knowingly would thereby render yourself liable
to a criminal prosecution."
"Wall," said the landlord, fetching a long breath, "that's a
purty long sarmon for a chap that rips a little now and then.
But be easy, be easy, this here harpooneer I have been tellin'
you of has just arrived from the south seas, where he bought up
a lot of 'balmed New Zealand heads (great curios, you know),
and he's sold all on 'em but one, and that one he's trying to sell
to-night, cause to-morrow's Sunday, and it would not do to be sellin'
human heads about the streets when folks is goin' to churches.
He wanted to last Sunday, but I stopped him just as he was goin'
out of the door with four heads strung on a string, for all
the airth like a string of inions."
This account cleared up the otherwise unaccountable mystery,
and showed that the landlord, after all, had had no idea of fooling me--
but at the same time what could I think of a harpooneer who stayed
out of a Saturday night clean into the holy Sabbath, engaged in such
a cannibal business as selling the heads of dead idolators?
"Depend upon it, landlord, that harpooneer is a dangerous man."
"He pays reg'lar," was the rejoinder. "But come, it's getting
dreadful late, you had better be turning flukes--it's a nice bed:
Sal and me slept in that ere bed the night we were spliced.
There's plenty of room for two to kick about in that bed;
it's an almighty big bed that. Why, afore we give it up,
Sal used to put our Sam and little Johnny in the foot of it.
But I got a dreaming and sprawling about one night, and somehow,
Sam got pitched on the floor, and came near breaking his arm.
After that, Sal said it wouldn't do. Come along here,
I'll give ye a glim in a jiffy;" and so saying he lighted
a candle and held it towards me, offering to lead the way.
But I stood irresolute; when looking at a clock in the corner,
he exclaimed "I vum it's Sunday--you won't see that harpooneer to-night;
he's come to anchor somewhere--come along then; do come;
won't ye come?"
I considered the matter a moment, and then up stairs we went,
and I was ushered into a small room, cold as a clam, and furnished,
sure enough, with a prodigious bed, almost big enough indeed
for any four harpooneers to sleep abreast.
"There," said the landlord, placing the candle on a crazy old
sea chest that did double duty as a wash-stand and centre table;
"there, make yourself comfortable now; and good night to ye."
I turned round from eyeing the bed, but he had disappeared.
Folding back the counterpane, I stooped over the bed.
Though none of the most elegant, it yet stood the scrutiny
tolerably well. I then glanced round the room; and besides
the bedstead and centre table, could see no other furniture
belonging to the place, but a rude shelf, the four walls,
and a papered fireboard representing a man striking a whale.
Of things not properly belonging to the room, there was a
hammock lashed up, and thrown upon the floor in one corner;
also a large seaman's bag, containing the harpooneer's wardrobe,
no doubt in lieu of a land trunk. Likewise, there was a parcel
of outlandish bone fish hooks on the shelf over the fire-place,
and a tall harpoon standing at the head of the bed.
But what is this on the chest? I took it up, and held it close
to the light, and felt it, and smelt it, and tried every way
possible to arrive at some satisfactory conclusion concerning it.
I can compare it to nothing but a large door mat,
ornamented at the edges with little tinkling tags something
like the stained porcupine quills round an Indian moccasin.
There was a hole or slit in the middle of this mat, as you see
the same in South American ponchos. But could it be possible
that any sober harpooneer would get into a door mat, and parade
the streets of any Christian town in that sort of guise?
I put it on, to try it, and it weighed me down like a hamper,
being uncommonly shaggy and thick, and I thought a little damp,
as though this mysterious harpooneer had been wearing it
of a rainy day. I went up in it to a bit of glass stuck
against the wall, and I never saw such a sight in my life.
I tore myself out of it in such a hurry that I gave myself
a kink in the neck.
I sat down on the side of the bed, and commenced thinking
about this head-peddling harpooneer, and his door mat.
After thinking some time on the bed-side, I got up and took off my
monkey jacket, and then stood in the middle of the room thinking.
I then took off my coat, and thought a little more in my shirt sleeves.
But beginning to feel very cold now, half undressed as I was,
and remembering what the landlord said about the harpooneer's
not coming home at all that night, it being so very late,
I made no more ado, but jumped out of my pantaloons and boots,
and then blowing out the light tumbled into bed, and commended
myself to the care of heaven.
Whether that mattress was stuffed with corncobs or broken crockery,
there is no telling, but I rolled about a good deal, and could
not sleep for a long time. At last I slid off into a light doze,
and had pretty nearly made a good offing towards the land of Nod,
when I heard a heavy footfall in the passage, and saw a glimmer
of light come into the room from under the door.
Lord save me, thinks I, that must be the harpooneer,
the infernal head-peddler. But I lay perfectly still,
and resolved not to say a word till spoken to. Holding a light
in one hand, and that identical New Zealand head in the other,
the stranger entered the room, and without looking towards
the bed, placed his candle a good way off from me on the floor
in one corner, and then began working away at the knotted cords
of the large bag I before spoke of as being in the room.
I was all eagerness to see his face, but he kept it averted
for some time while employed in unlacing the bag's mouth.
This accomplished, however, he turned round--when, good heavens;
what a sight! Such a face! It was of a dark, purplish, yellow color,
here and there stuck over with large blackish looking squares.
Yes, it's just as I thought, he's a terrible bedfellow;
he's been in a fight, got dreadfully cut, and here he is,
just from the surgeon. But at that moment he chanced to turn
his face so towards the light, that I plainly saw they could not
be sticking-plasters at all, those black squares on his cheeks.
They were stains of some sort or other. At first I knew not what
to make of this; but soon an inkling of the truth occurred to me.
I remembered a story of a white man--a whaleman too--
who, falling among the cannibals, had been tattooed by them.
I concluded that this harpooneer, in the course of his
distant voyages, must have met with a similar adventure.
And what is it, thought I, after all! It's only his outside;
a man can be honest in any sort of skin. But then, what to make of
his unearthly complexion, that part of it, I mean, lying round about,
and completely independent of the squares of tattooing.
To be sure, it might be nothing but a good coat of tropical tanning;
but I never heard of a hot sun's tanning a white man into a
purplish yellow one. However, I had never been in the South Seas;
and perhaps the sun there produced these extraordinary effects
upon the skin. Now, while all these ideas were passing
through me like lightning, this harpooneer never noticed me
at all. But, after some difficulty having opened his bag,
he commenced fumbling in it, and presently pulled out a sort
of tomahawk, and a seal-skin wallet with the hair on.
Placing these on the old chest in the middle of the room,
he then took the New Zealand head--a ghastly thing enough--
and crammed it down into the bag. He now took off his hat--
a new beaver hat--when I came nigh singing out with fresh surprise.
There was no hair on his head--none to speak of at least--
nothing but a small scalp-knot twisted up on his forehead. His bald
purplish head now looked for all the world like a mildewed skull.
Had not the stranger stood between me and the door, I would
have bolted out of it quicker than ever I bolted a dinner.
Even as it was, I thought something of slipping out of
the window, but it was the second floor back. I am no coward,
but what to make of this headpeddling purple rascal altogether
passed my comprehension. Ignorance is the parent of fear,
and being completely nonplussed and confounded about the stranger,
I confess I was now as much afraid of him as if it was the devil
himself who had thus broken into my room at the dead of night.
In fact, I was so afraid of him that I was not game enough
just then to address him, and demand a satisfactory answer
concerning what seemed inexplicable in him.
Meanwhile, he continued the business of undressing, and at
last showed his chest and arms. As I live, these covered
parts of him were checkered with the same squares as his face,
his back, too, was all over the same dark squares;
he seemed to have been in a Thirty Years' War, and just
escaped from it with a sticking-plaster shirt.
Still more, his very legs were marked, as if a parcel of dark
green frogs were running up the trunks of young palms.
It was now quite plain that he must be some abominable savage
or other shipped aboard of a whaleman in the South Seas,
and so landed in this Christian country. I quaked to think of it.
A peddler of heads too--perhaps the heads of his own brothers.
He might take a fancy to mine--heavens! look at that tomahawk!
But there was no time for shuddering, for now the savage went
about something that completely fascinated my attention,
and convinced me that he must indeed be a heathen.
Going to his heavy grego, or wrapall, or dreadnaught,
which he had previously hung on a chair, he fumbled in the pockets,
and produced at length a curious little deformed image with a hunch
on its back, and exactly the color of a three days' old Congo baby.
Remembering the embalmed head, at first I almost thought that this
black manikin was a real baby preserved in some similar manner.
But seeing that it was not at all limber, and that it glistened
a good deal like polished ebony, I concluded that it must
be nothing but a wooden idol, which indeed it proved to be.
For now the savage goes up to the empty fire-place,
and removing the papered fire-board, sets up this little
hunch-backed image, like a tenpin, between the andirons.
The chimney jambs and all the bricks inside were very sooty,
so that I thought this fire-place made a very appropriate little
shrine or chapel for his Congo idol.
I now screwed my eyes hard towards the half hidden image,
feeling but ill at ease meantime--to see what was next to follow.
First he takes about a double handful of shavings out of his grego pocket,
and places them carefully before the idol; then laying a bit of ship
biscuit on top and applying the flame from the lamp, he kindled
the shavings into a sacrificial blaze. Presently, after many hasty
snatches into the fire, and still hastier withdrawals of his fingers
(whereby he seemed to be scorching them badly), he at last succeeded
in drawing out the biscuit; then blowing off the heat and ashes
a little, he made a polite offer of it to the little negro.
But the little devil did not seem to fancy such dry sort of fare at all;
he never moved his lips. All these strange antics were accompanied
by still stranger guttural noises from the devotee, who seemed to be
praying in a sing-song or else singing some pagan psalmody or other,
during which his face twitched about in the most unnatural manner.
At last extinguishing the fire, he took the idol up very unceremoniously,
and bagged it again in his grego pocket as carelessly as if he were
a sportsman bagging a dead woodcock.
All these queer proceedings increased my uncomfortableness,
and seeing him now exhibiting strong symptoms of concluding
his business operations, and jumping into bed with me, I thought
it was high time, now or never, before the light was put out,
to break the spell in which I had so long been bound.
But the interval I spent in deliberating what to say, was a fatal one.
Taking up his tomahawk from the table, he examined the head of it
for an instant, and then holding it to the light, with his mouth
at the handle, he puffed out great clouds of tobacco smoke.
The next moment the light was extinguished, and this wild cannibal,
tomahawk between his teeth, sprang into bed with me. I sang out,
I could not help it now; and giving a sudden grunt of astonishment
he began feeling me.
Stammering out something, I knew not what, I rolled away from him
against the wall, and then conjured him, whoever or whatever he might be,
to keep quiet, and let me get up and light the lamp again.
But his guttural responses satisfied me at once that he but ill
comprehended my meaning.
"Who-e debel you?"--he at last said--"you no speak-e, dam-me, I kill-e."
And so saying the lighted tomahawk began flourishing about me in the dark.
"Landlord, for God's sake, Peter Coffin!" shouted
I. "Landlord! Watch! Coffin! Angels! save me!"
"Speak-e! tell-ee me who-ee be, or dam-me, I kill-e!" again growled
the cannibal, while his horrid flourishings of the tomahawk scattered
the hot tobacco ashes about me till I thought my linen would get on fire.
But thank heaven, at that moment the landlord came into the room light
in hand, and leaping from the bed I ran up to him.
"Don't be afraid now," said he, grinning again, "Queequeg here wouldn't
harm a hair of your head."
"Stop your grinning," shouted I, "and why didn't you tell me
that that infernal harpooneer was a cannibal?"
"I thought ye know'd it;--didn't I tell ye, he was a peddlin'
heads around town?--but turn flukes again and go to sleep.
Queequeg, look here--you sabbee me, I sabbee--you this man
sleepe you--you sabbee?"
"Me sabbee plenty"--grunted Queequeg, puffing away at his pipe
and sitting up in bed.
"You gettee in," he added, motioning to me with his tomahawk,
and throwing the clothes to one side. He really did this
in not only a civil but a really kind and charitable way.
I stood looking at him a moment. For all his tattooings
he was on the whole a clean, comely looking cannibal.
What's all this fuss I have been making about, thought I
to myself--the man's a human being just as I am: he has just
as much reason to fear me, as I have to be afraid of him.
Better sleep with a sober cannibal than a drunken Christian.
"Landlord," said I, "tell him to stash his tomahawk there, or pipe,
or whatever you call it; tell him to stop smoking, in short, and I will
turn in with him. But I don't fancy having a man smoking in bed with me.
It's dangerous. Besides, I ain't insured."
This being told to Queequeg, he at once complied, and again politely
motioned me to get into bed--rolling over to one side as much as to say--
I won't touch a leg of ye."
"Good night, landlord," said I, "you may go."
I turned in, and never slept better in my life.
tokenizers/tests/testthat/test-ngrams.R 0000644 0001762 0000144 00000010502 13252224016 020033 0 ustar ligges users context("N-gram tokenizers")
test_that("Shingled n-gram tokenizer works as expected", {
stopwords <- c("chapter", "me")
out_l <- tokenize_ngrams(docs_l, n = 3, n_min = 2, stopwords = stopwords)
out_c <- tokenize_ngrams(docs_c, n = 3, n_min = 2, stopwords = stopwords)
out_1 <- tokenize_ngrams(docs_c[1], n = 3, n_min = 2, stopwords = stopwords,
simplify = TRUE)
expect_is(out_l, "list")
expect_is(out_l[[1]], "character")
expect_is(out_c, "list")
expect_is(out_c[[1]], "character")
expect_is(out_1, "character")
expect_identical(out_l, out_c)
expect_identical(out_l[[1]], out_1)
expect_identical(out_c[[1]], out_1)
# test for https://github.com/lmullen/tokenizers/issues/14
expect_identical(tokenize_ngrams("one two three", n = 3, n_min = 2),
tokenize_ngrams("one two three", n = 5, n_min = 2))
expect_named(out_l, names(docs_l))
expect_named(out_c, names(docs_c))
expect_error(tokenize_ngrams(bad_list))
})
test_that("Shingled n-gram tokenizer produces correct output", {
# skip_on_os("windows")
stopwords <- c("chapter", "me")
out_1 <- tokenize_ngrams(docs_c[1], n = 3, n_min = 2, stopwords = stopwords,
simplify = TRUE)
expected <- c("1 loomings", "1 loomings call", "loomings call",
"loomings call ishmael", "call ishmael", "call ishmael some")
expect_identical(head(out_1, 6), expected)
})
test_that("Shingled n-gram tokenizer consistently produces NAs where appropriate", {
test <- c("This is a text", NA, "So is this")
names(test) <- letters[1:3]
out <- tokenize_ngrams(test)
expect_true(is.na(out$b))
})
test_that("Skip n-gram tokenizer consistently produces NAs where appropriate", {
test <- c("This is a text", NA, "So is this")
names(test) <- letters[1:3]
out <- tokenize_skip_ngrams(test)
expect_true(is.na(out$b))
})
test_that("Skip n-gram tokenizer can use stopwords", {
test <- c("This is a text", "So is this")
names(test) <- letters[1:2]
out <- tokenize_skip_ngrams(test, stopwords = "is", n = 2, n_min = 2)
expect_equal(length(out$a), 3)
expect_identical(out$a[1], "this a")
})
test_that("Skips with values greater than k are refused", {
expect_false(check_width(c(0, 4, 5), k = 2))
expect_true(check_width(c(0, 3, 5), k = 2))
expect_false(check_width(c(0, 1, 3), k = 0))
expect_true(check_width(c(0, 1, 2), k = 0))
expect_false(check_width(c(0, 10, 11, 12), k = 5))
expect_true(check_width(c(0, 6, 11, 16, 18), k = 5))
})
test_that("Combinations for skip grams are correct", {
skip_pos <- get_valid_skips(2, 2)
expect_is(skip_pos, "list")
expect_length(skip_pos, 3)
expect_identical(skip_pos, list(c(0, 1), c(0, 2), c(0, 3)))
skip_pos2 <- get_valid_skips(3, 2)
expect_identical(skip_pos2, list(
c(0, 1, 2),
c(0, 1, 3),
c(0, 1, 4),
c(0, 2, 3),
c(0, 2, 4),
c(0, 2, 5),
c(0, 3, 4),
c(0, 3, 5),
c(0, 3, 6)))
})
test_that("Skip n-gram tokenizer works as expected", {
stopwords <- c("chapter", "me")
out_l <- tokenize_skip_ngrams(docs_l, n = 3, k = 2)
out_c <- tokenize_skip_ngrams(docs_c, n = 3, k = 2)
out_1 <- tokenize_skip_ngrams(docs_c[1], n = 3, k = 2, simplify = TRUE)
expect_is(out_l, "list")
expect_is(out_l[[1]], "character")
expect_is(out_c, "list")
expect_is(out_c[[1]], "character")
expect_is(out_1, "character")
expect_identical(out_l, out_c)
expect_identical(out_l[[1]], out_1)
expect_identical(out_c[[1]], out_1)
expect_named(out_l, names(docs_l))
expect_named(out_c, names(docs_c))
expect_error(tokenize_skip_ngrams(bad_list))
})
test_that("Skip n-gram tokenizer produces correct output", {
out_n2_k2 <- tokenize_skip_ngrams(input, n = 2, n_min = 2, k = 2, simplify = TRUE)
expect_equal(sort(skip2_bigrams), sort(out_n2_k2))
out_n3_k2 <- tokenize_skip_ngrams(input, n = 3, n_min = 3, k = 2, simplify = TRUE)
expect_equal(sort(skip2_trigrams), sort(out_n3_k2))
})
test_that("Skip n-gram tokenizers respects stopwords", {
out_1 <- tokenize_skip_ngrams("This is a sentence that is for the test.",
n = 3, k = 2, stopwords = c("a", "the"),
simplify = TRUE)
expect_equal(length(grep("the", out_1)), 0)
})
test_that("Skip n-gram tokenizer warns about large combinations", {
expect_warning(get_valid_skips(n = 7, k = 2), "Input n and k will")
})
tokenizers/tests/testthat/test-stem.R 0000644 0001762 0000144 00000001523 12775200571 017530 0 ustar ligges users context("Stem tokenizers")
test_that("Word stem tokenizer works as expected", {
out_l <- tokenize_word_stems(docs_l)
out_c <- tokenize_word_stems(docs_c)
out_1 <- tokenize_word_stems(docs_c[1], simplify = TRUE)
expect_is(out_l, "list")
expect_is(out_l[[1]], "character")
expect_is(out_c, "list")
expect_is(out_c[[1]], "character")
expect_is(out_1, "character")
expect_identical(out_l, out_c)
expect_identical(out_l[[1]], out_1)
expect_identical(out_c[[1]], out_1)
expect_named(out_l, names(docs_l))
expect_named(out_c, names(docs_c))
expect_error(tokenize_word_stems(bad_list))
})
test_that("Stem tokenizer produces correct output", {
# skip_on_os("windows")
out_1 <- tokenize_word_stems(docs_c[1], simplify = TRUE)
expected <- c("in", "my", "purs", "and", "noth")
expect_identical(out_1[20:24], expected)
})
tokenizers/tests/testthat/helper-data.R 0000644 0001762 0000144 00000002626 13256545214 017777 0 ustar ligges users paths <- list.files(".", pattern = "\\.txt$", full.names = TRUE)
docs_full <- lapply(paths, readLines, encoding = "UTF-8")
docs_l <- lapply(docs_full, paste, collapse = "\n")
# docs_l <- lapply(docs_full, enc2utf8)
docs_c <- unlist(docs_l)
names(docs_l) <- basename(paths)
names(docs_c) <- basename(paths)
docs_df <- data.frame(doc_id = names(docs_c),
text = unname(docs_c),
stringsAsFactors = FALSE)
bad_list <- list(a = paste(letters, collapse = " "), b = letters)
# Using this sample sentence only because it comes from the paper where
# skip n-grams are defined. Not my favorite sentence.
input <- "Insurgents killed in ongoing fighting."
bigrams <- c("insurgents killed", "killed in", "in ongoing", "ongoing fighting")
skip2_bigrams <- c("insurgents killed", "insurgents in", "insurgents ongoing",
"killed in", "killed ongoing", "killed fighting",
"in ongoing", "in fighting", "ongoing fighting")
trigrams <- c("insurgents killed in", "killed in ongoing", "in ongoing fighting")
skip2_trigrams <- c("insurgents killed in", "insurgents killed ongoing",
"insurgents killed fighting", "insurgents in ongoing",
"insurgents in fighting", "insurgents ongoing fighting",
"killed in ongoing", "killed in fighting",
"killed ongoing fighting", "in ongoing fighting")
tokenizers/src/ 0000755 0001762 0000144 00000000000 13257220650 013240 5 ustar ligges users tokenizers/src/skip_ngrams.cpp 0000644 0001762 0000144 00000004364 13257220650 016270 0 ustar ligges users #include
using namespace Rcpp;
CharacterVector skip_ngrams(CharacterVector words,
ListOf& skips,
std::set& stopwords) {
std::deque < std::string > checked_words;
std::string str_holding;
// Eliminate stopwords
for(unsigned int i = 0; i < words.size(); i++){
if(words[i] != NA_STRING){
str_holding = as(words[i]);
if(stopwords.find(str_holding) == stopwords.end()){
checked_words.push_back(str_holding);
}
}
}
str_holding.clear();
std::deque < std::string > holding;
unsigned int checked_size = checked_words.size();
for(unsigned int w = 0; w < checked_size; w++) {
for(unsigned int i = 0; i < skips.size(); i++){
unsigned int in_size = skips[i].size();
if(skips[i][in_size-1] + w < checked_size){
for(unsigned int j = 0; j < skips[i].size(); j++){
str_holding += " " + checked_words[skips[i][j] + w];
}
if(str_holding.size()){
str_holding.erase(0,1);
}
holding.push_back(str_holding);
str_holding.clear();
}
}
}
if(!holding.size()){
return CharacterVector(1,NA_STRING);
}
CharacterVector output(holding.size());
for(unsigned int i = 0; i < holding.size(); i++){
if(holding[i].size()){
output[i] = String(holding[i], CE_UTF8);
} else {
output[i] = NA_STRING;
}
}
return output;
}
//[[Rcpp::export]]
ListOf skip_ngrams_vectorised(ListOf words,
ListOf skips,
CharacterVector stopwords){
// Create output object and set up for further work
unsigned int input_size = words.size();
List output(input_size);
// Create stopwords set
std::set < std::string > checked_stopwords;
for(unsigned int i = 0; i < stopwords.size(); i++){
if(stopwords[i] != NA_STRING){
checked_stopwords.insert(as(stopwords[i]));
}
}
for(unsigned int i = 0; i < input_size; i++){
if(i % 10000 == 0){
Rcpp::checkUserInterrupt();
}
output[i] = skip_ngrams(words[i], skips, checked_stopwords);
}
return output;
}
tokenizers/src/shingle_ngrams.cpp 0000644 0001762 0000144 00000007110 13257220650 016743 0 ustar ligges users #include
using namespace Rcpp;
// calculates size of the ngram vector
inline size_t get_ngram_seq_len(int input_len, int ngram_min, int ngram_max) {
int out_ngram_len_adjust = 0;
for (size_t i = ngram_min - 1; i < ngram_max; i++)
out_ngram_len_adjust += i;
if(input_len < ngram_min)
return 0;
else
return input_len * (ngram_max - ngram_min + 1) - out_ngram_len_adjust;
}
CharacterVector generate_ngrams_internal(const CharacterVector terms_raw,
const int ngram_min,
const int ngram_max,
std::set &stopwords,
// pass buffer by reference to avoid memory allocation
// on each iteration
std::deque &terms_filtered_buffer,
const std::string ngram_delim) {
// clear buffer from previous iteration result
terms_filtered_buffer.clear();
std::string term;
// filter out stopwords
for (size_t i = 0; i < terms_raw.size(); i++) {
term = as(terms_raw[i]);
if(stopwords.find(term) == stopwords.end())
terms_filtered_buffer.push_back(term);
}
int len = terms_filtered_buffer.size();
size_t ngram_out_len = get_ngram_seq_len(len, ngram_min, std::min(ngram_max, len));
CharacterVector result(ngram_out_len);
std::string k_gram;
size_t k, i = 0, j_max_observed;
// iterates through input vector by window of size = n_max and build n-grams
// for terms ["a", "b", "c", "d"] and n_min = 1, n_max = 3
// will build 1:3-grams in following order
//"a" "a_b" "a_b_c" "b" "b_c" "b_c_d" "c" "c_d" "d"
for(size_t j = 0; j < len; j++ ) {
k = 1;
j_max_observed = j;
while (k <= ngram_max && j_max_observed < len) {
if( k == 1) {
k_gram = terms_filtered_buffer[j_max_observed];
} else {
k_gram = k_gram + ngram_delim + terms_filtered_buffer[j_max_observed];
}
if(k >= ngram_min) {
result[i] = String(k_gram, CE_UTF8);
i++;
}
j_max_observed = j + k;
k = k + 1;
}
}
if(!result.size()){
result.push_back(NA_STRING);
}
return result;
}
// [[Rcpp::export]]
ListOf generate_ngrams_batch(const ListOf documents_list,
const int ngram_min,
const int ngram_max,
CharacterVector stopwords = CharacterVector(),
const String ngram_delim = " ") {
std::deque terms_filtered_buffer;
const std::string std_string_delim = ngram_delim.get_cstring();
size_t n_docs = documents_list.size();
List result(n_docs);
CharacterVector terms;
std::set stopwords_set;
for(size_t i = 0; i < stopwords.size(); i++){
if(stopwords[i] != NA_STRING){
stopwords_set.insert(as(stopwords[i]));
}
}
for (size_t i_document = 0; i_document < n_docs; i_document++) {
if(i_document % 10000 == 0){
Rcpp::checkUserInterrupt();
}
terms = documents_list[i_document];
result[i_document] = generate_ngrams_internal(documents_list[i_document],
ngram_min, ngram_max,
stopwords_set,
terms_filtered_buffer,
std_string_delim);
}
return result;
}
tokenizers/src/RcppExports.cpp 0000644 0001762 0000144 00000004477 13257220650 016251 0 ustar ligges users // Generated by using Rcpp::compileAttributes() -> do not edit by hand
// Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
#include
using namespace Rcpp;
// generate_ngrams_batch
ListOf generate_ngrams_batch(const ListOf documents_list, const int ngram_min, const int ngram_max, CharacterVector stopwords, const String ngram_delim);
RcppExport SEXP _tokenizers_generate_ngrams_batch(SEXP documents_listSEXP, SEXP ngram_minSEXP, SEXP ngram_maxSEXP, SEXP stopwordsSEXP, SEXP ngram_delimSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< const ListOf >::type documents_list(documents_listSEXP);
Rcpp::traits::input_parameter< const int >::type ngram_min(ngram_minSEXP);
Rcpp::traits::input_parameter< const int >::type ngram_max(ngram_maxSEXP);
Rcpp::traits::input_parameter< CharacterVector >::type stopwords(stopwordsSEXP);
Rcpp::traits::input_parameter< const String >::type ngram_delim(ngram_delimSEXP);
rcpp_result_gen = Rcpp::wrap(generate_ngrams_batch(documents_list, ngram_min, ngram_max, stopwords, ngram_delim));
return rcpp_result_gen;
END_RCPP
}
// skip_ngrams_vectorised
ListOf skip_ngrams_vectorised(ListOf words, ListOf skips, CharacterVector stopwords);
RcppExport SEXP _tokenizers_skip_ngrams_vectorised(SEXP wordsSEXP, SEXP skipsSEXP, SEXP stopwordsSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< ListOf >::type words(wordsSEXP);
Rcpp::traits::input_parameter< ListOf >::type skips(skipsSEXP);
Rcpp::traits::input_parameter< CharacterVector >::type stopwords(stopwordsSEXP);
rcpp_result_gen = Rcpp::wrap(skip_ngrams_vectorised(words, skips, stopwords));
return rcpp_result_gen;
END_RCPP
}
static const R_CallMethodDef CallEntries[] = {
{"_tokenizers_generate_ngrams_batch", (DL_FUNC) &_tokenizers_generate_ngrams_batch, 5},
{"_tokenizers_skip_ngrams_vectorised", (DL_FUNC) &_tokenizers_skip_ngrams_vectorised, 3},
{NULL, NULL, 0}
};
RcppExport void R_init_tokenizers(DllInfo *dll) {
R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
R_useDynamicSymbols(dll, FALSE);
}
tokenizers/NAMESPACE 0000644 0001762 0000144 00000003747 13256545214 013710 0 ustar ligges users # Generated by roxygen2: do not edit by hand
S3method(tokenize_character_shingles,data.frame)
S3method(tokenize_character_shingles,default)
S3method(tokenize_characters,data.frame)
S3method(tokenize_characters,default)
S3method(tokenize_lines,data.frame)
S3method(tokenize_lines,default)
S3method(tokenize_ngrams,data.frame)
S3method(tokenize_ngrams,default)
S3method(tokenize_paragraphs,data.frame)
S3method(tokenize_paragraphs,default)
S3method(tokenize_ptb,data.frame)
S3method(tokenize_ptb,default)
S3method(tokenize_regex,data.frame)
S3method(tokenize_regex,default)
S3method(tokenize_sentences,data.frame)
S3method(tokenize_sentences,default)
S3method(tokenize_skip_ngrams,data.frame)
S3method(tokenize_skip_ngrams,default)
S3method(tokenize_tweets,data.frame)
S3method(tokenize_tweets,default)
S3method(tokenize_word_stems,data.frame)
S3method(tokenize_word_stems,default)
S3method(tokenize_words,data.frame)
S3method(tokenize_words,default)
export(chunk_text)
export(count_characters)
export(count_sentences)
export(count_words)
export(tokenize_character_shingles)
export(tokenize_characters)
export(tokenize_lines)
export(tokenize_ngrams)
export(tokenize_paragraphs)
export(tokenize_ptb)
export(tokenize_regex)
export(tokenize_sentences)
export(tokenize_skip_ngrams)
export(tokenize_tweets)
export(tokenize_word_stems)
export(tokenize_words)
importFrom(Rcpp,sourceCpp)
importFrom(SnowballC,getStemLanguages)
importFrom(SnowballC,wordStem)
importFrom(stringi,stri_c)
importFrom(stringi,stri_detect_regex)
importFrom(stringi,stri_opts_regex)
importFrom(stringi,stri_replace_all_charclass)
importFrom(stringi,stri_replace_all_regex)
importFrom(stringi,stri_split_boundaries)
importFrom(stringi,stri_split_charclass)
importFrom(stringi,stri_split_fixed)
importFrom(stringi,stri_split_lines)
importFrom(stringi,stri_split_regex)
importFrom(stringi,stri_sub)
importFrom(stringi,stri_subset_charclass)
importFrom(stringi,stri_trans_tolower)
importFrom(stringi,stri_trim_both)
useDynLib(tokenizers, .registration = TRUE)
tokenizers/NEWS.md 0000644 0001762 0000144 00000004451 13257217764 013567 0 ustar ligges users # tokenizers 0.2.1
- Add citation information to JOSS paper.
# tokenizers 0.2.0
## Features
- Add the `tokenize_ptb()` function for Penn Treebank tokenizations (@jrnold) (#12).
- Add a function `chunk_text()` to split long documents into pieces (#30).
- New functions to count words, characters, and sentences without tokenization (#36).
- New function `tokenize_tweets()` preserves usernames, hashtags, and URLS (@kbenoit) (#44).
- The `stopwords()` function has been removed in favor of using the **stopwords** package (#46).
- The package now complies with the basic recommendations of the **Text Interchange Format**. All tokenization functions are now methods. This enables them to take corpus inputs as either TIF-compliant named character vectors, named lists, or data frames. All outputs are still named lists of tokens, but these can be easily coerced to data frames of tokens using the `tif` package. (#49)
- Add a new vignette "The Text Interchange Formats and the tokenizers Package" (#49).
## Bug fixes and performance improvements
- `tokenize_skip_ngrams` has been improved to generate unigrams and bigrams, according to the skip definition (#24).
- C++98 has replaced the C++11 code used for n-gram generation, widening the range of compilers `tokenizers` supports (@ironholds) (#26).
- `tokenize_skip_ngrams` now supports stopwords (#31).
- If tokenisers fail to generate tokens for a particular entry, they return `NA` consistently (#33).
- Keyboard interrupt checks have been added to Rcpp-backed functions to enable users to terminate them before completion (#37).
- `tokenize_words()` gains arguments to preserve or strip punctuation and numbers (#48).
- `tokenize_skip_ngrams()` and `tokenize_ngrams()` to return properly marked UTF8 strings on Windows (@patperry) (#58).
# tokenizers 0.1.4
- Add the `tokenize_character_shingles()` tokenizer.
- Improvements to documentation.
# tokenizers 0.1.3
- Add vignette.
- Improvements to n-gram tokenizers.
# tokenizers 0.1.2
- Add stopwords for several languages.
- New stopword options to `tokenize_words()` and `tokenize_word_stems()`.
# tokenizers 0.1.1
- Fix failing test in non-UTF-8 locales.
# tokenizers 0.1.0
- Initial release with tokenizers for characters, words, word stems, sentences
paragraphs, n-grams, skip n-grams, lines, and regular expressions.
tokenizers/data/ 0000755 0001762 0000144 00000000000 13070504253 013357 5 ustar ligges users tokenizers/data/mobydick.rda 0000644 0001762 0000144 00001366322 13070504253 015665 0 ustar ligges users BZh91AY&SY9J