lexRankr/0000755000176200001440000000000013443532523012044 5ustar liggesuserslexRankr/inst/0000755000176200001440000000000013443530263013020 5ustar liggesuserslexRankr/inst/doc/0000755000176200001440000000000013443530263013565 5ustar liggesuserslexRankr/inst/doc/Analyzing_Twitter_with_LexRankr.html.asis0000644000176200001440000000021313233415326023723 0ustar liggesusers%\VignetteIndexEntry{Analyzing Twitter with LexRankr} %\VignetteEngine{R.rsp::asis} %\VignetteKeyword{twitter} %\VignetteKeyword{lexrankr} lexRankr/inst/doc/Analyzing_Twitter_with_LexRankr.html0000644000176200001440000005347413443530263023007 0ustar liggesusers Using lexRankr to find a user’s most representative tweets

Using lexRankr to find a user’s most representative tweets

Adam Spannbauer

2017-03-01

Packages Used

library(lexRankr)
library(tidyverse)
library(stringr)
library(httr)
library(jsonlite)

In this document we get tweets from twitter using the twitter API and then analyze the tweets using lexRankr in order to find a user’s most representative tweets. If you don’t care about interacting with the twitter api you can jump to the lexrank analysis.

Get user tweets

Before we can analyze tweets we’ll need some tweets to analyze. We’ll be using Twitter’s API, and you’ll need to set up an account to get all keys needed for the api. The credentials needed for the api are: consumer key, consumer secret, token, and token secret. Below is how to set up your credentials to use the twitter api in this vignette.

# set api tokens/keys/secrets as environment vars
# Sys.setenv(cons_key     = 'my_cons_key')
# Sys.setenv(cons_secret  = 'my_cons_sec')
# Sys.setenv(token        = 'my_token')
# Sys.setenv(token_secret = 'my_token_sec')

#sign oauth
auth <- httr::oauth_app("twitter", key=Sys.getenv("cons_key"), secret=Sys.getenv("cons_secret"))
sig  <- httr::sign_oauth1.0(auth, token=Sys.getenv("token"), token_secret=Sys.getenv("token_secret"))

Now that we have our credentials set up, let’s write a function to get a user’s tweets from the api. Below the function get_timeline_df is defined. The function takes a user’s twitter handle, the number of tweets to get from the api, and the credentials we just set up. The function will return a dataframe with the columns created_at, favorite_count, retweet_count, text. The twitter api limits 200 tweets per get, so we will use a loop until we get the desired number of tweets.

get_timeline_df <- function(user, n_tweets=200, oauth_sig) {
  i <- 0
  n_left <- n_tweets
  timeline_df <- NULL
  #loop until n_tweets are all got
  while (n_left > 0) {
    n_to_get <- min(200, n_left)
    i <- i+1
    #incorporae max id in get_url (so as not to download same 200 tweets repeatedly)
    if (i==1) {
      get_url <- paste0("https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=",
                       user,"&count=", n_to_get)
    } else {
      get_url <- paste0("https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=",
                       user,"&count=",n_to_get,"&max_id=", max_id)
    }
    #GET tweets
    response <- httr::GET(get_url, oauth_sig)
    #extract content and clean up
    response_content <- httr::content(response)
    json_content     <- jsonlite::toJSON(response_content)
    #clean out evil special chars
    json_conv <- iconv(json_content, "UTF-8", "ASCII", sub = "") %>%
      stringr::str_replace_all("\003", "") #special character (^C) not caught by above clean
    timeline_list <- jsonlite::fromJSON(json_conv)
    #extract desired fields
    fields_i_care_about <- c("id", "text", "favorite_count", "retweet_count", "created_at")
    timeline_df <- purrr::map(fields_i_care_about, ~unlist(timeline_list[[.x]])) %>% 
      purrr::set_names(fields_i_care_about) %>% 
      dplyr::as_data_frame() %>% 
      dplyr::bind_rows(timeline_df) %>% 
      dplyr::distinct()
    #store min id (oldest tweet) to set as max id for next GET
    max_id <- min(purrr::map_dbl(timeline_list$id, 1))
    #update number of tweets left
    n_left <- n_left-n_to_get
  }
  return(timeline_df)
}

We can now use our function to gather a user’s tweets with the additional information of date-time, favorites, retweets. Lets use one of the most famous twitter accounts as of late: @realDonaldTrump.

tweets_df <- get_timeline_df("realDonaldTrump", 600, sig) %>% 
    mutate(text = str_replace_all(text, "\n", " ")) #clean out newlines for display

tweets_df %>% 
  head(n=3) %>% 
  select(text, created_at) %>% 
  knitr::kable()
text created_at
Yes, it is true - Carlos Slim, the great businessman from Mexico, called me about getting together for a meeting. We met, HE IS A GREAT GUY! Tue Dec 20 20:27:57 +0000 2016
especially how to get people, even with an unlimited budget, out to vote in the vital swing states ( and more). They focused on wrong states Tue Dec 20 13:09:18 +0000 2016
Bill Clinton stated that I called him after the election. Wrong, he called me (with a very nice congratulations). He “doesn’t know much” … Tue Dec 20 13:03:59 +0000 2016

Lexrank Analysis

We now have a dataframe that contains a column of tweets. This column of tweets will be the subject of the rest of the analysis. With the data in this format, we only need to call the bind_lexrank function to apply the lexrank algorithm to the tweets. The function will add a column of lexrank scores. The higher the lexrank score the more representative the tweet is of the tweets that we downloaded.

note: typically one would parse documents into sentences before applying lexrank (?unnest_sentences); however we will equate tweets to sentences for this analysis

tweets_df %>% 
  bind_lexrank(text, id, level="sentences") %>% 
  arrange(desc(lexrank)) %>% 
  head(n=5) %>% 
  select(text, lexrank) %>% 
  knitr::kable(caption = "Most Representative @realDonaldTrump Tweets")
Most Representative @realDonaldTrump Tweets
text lexrank
MAKE AMERICA GREAT AGAIN! 0.0087551
Well, the New Year begins. We will, together, MAKE AMERICA GREAT AGAIN! 0.0085258
HAPPY PRESIDENTS DAY - MAKE AMERICA GREAT AGAIN! 0.0082361
Happy Thanksgiving to everyone. We will, together, MAKE AMERICA GREAT AGAIN! 0.0060486
Hopefully, all supporters, and those who want to MAKE AMERICA GREAT AGAIN, will go to D.C. on January 20th. It will be a GREAT SHOW! 0.0059713

Repeating tweetRank analysis for other users

With our get_timeline_df function we can easily repeat this analysis for other users. Below we repeat the whole analysis in a single magrittr pipeline.

get_timeline_df("dog_rates", 600, sig) %>% 
  mutate(text = str_replace_all(text, "\n", " ")) %>% 
  bind_lexrank(text, id, level="sentences") %>% 
  arrange(desc(lexrank)) %>% 
  head(n=5) %>% 
  select(text, lexrank) %>% 
  knitr::kable(caption = "Most Representative @dog_rates Tweets")
Most Representative @dog_rates Tweets
text lexrank
@Lin_Manuel good day good dog 0.0167123
Please keep loving 0.0099864
Here we h*ckin go 0.0085708
Last day to get anything from our Valentine’s Collection by Valentine’s Day! 0.0077583
Even if I tried (which I would never), I’d last like 17 seconds 0.0073899



lexRankr/tests/0000755000176200001440000000000013443530264013206 5ustar liggesuserslexRankr/tests/testthat.R0000644000176200001440000000007413177136432015175 0ustar liggesuserslibrary(testthat) library(lexRankr) test_check("lexRankr") lexRankr/tests/testthat/0000755000176200001440000000000013443532523015046 5ustar liggesuserslexRankr/tests/testthat/test-unnest_sentences.R0000644000176200001440000000643713213603250021532 0ustar liggesuserscontext("unnest_sentences") # test output str -------------------------------------------------------- test_that("correct ouput class and str", { df <- data.frame(doc_id = 1:3, text = c("Testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) test_result <- unnest_sentences(df, out, text) expect_equal(dim(test_result), c(4,3)) expect_true(is.data.frame(test_result)) expect_equal(names(test_result), c("doc_id","sent_id","out")) test_result <- unnest_sentences(df, out, text, drop=FALSE) expect_equal(dim(test_result), c(4,4)) expect_equal(names(test_result), c("doc_id","text","sent_id","out")) }) # test bad input ------------------------------------------------------- test_that("test input checking", { df <- data.frame(doc_id = 1:3, text = c("Testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) expect_error(unnest_sentences(df, out, fake)) expect_error(unnest_sentences(NULL, out, text)) expect_error(unnest_sentences(df, out, text, drop = NULL)) expect_error(unnest_sentences(df, out, text, doc_id = fake)) }) # test output val ------------------------------------------------------ test_that("output value", { df <- data.frame(doc_id = 1:3, text = c("Testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) test_result <- unnest_sentences(df, out, text) expected_result <- data.frame(doc_id = c(1L, 1L, 2L, 3L), sent_id = c(1L, 2L, 1L, 1L), out = c("Testing the system.", "Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) expect_equal(test_result, expected_result) df <- data.frame(doc_id = c(1,1,3), text = c("Testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) test_result <- unnest_sentences(df, out, text, doc_id = doc_id) expected_result <- data.frame(doc_id = c(1L, 1L, 1L, 3L), sent_id = c(1L, 2L, 3L, 1L), out = c("Testing the system.", "Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) expect_equal(test_result, expected_result) }) lexRankr/tests/testthat/test-lexRankFromSimil.R0000644000176200001440000000621413213603250021366 0ustar liggesuserscontext("lexRankFromSimil") # test object out str and class --------------------------------------- test_that("object out str and class", { testDocs <- c("Testing 1, 2, 3.", "Is everything working as expected in my test?", "Is it working?") tokenDf <- sentenceTokenParse(testDocs)$tokens similDf <- sentenceSimil(sentenceId = tokenDf$sentenceId, token = tokenDf$token, docId = tokenDf$docId) testResult <- lexRankFromSimil(similDf$sent1, similDf$sent2, similDf$similVal) expect_equal(class(testResult), "data.frame") expect_equal(names(testResult), c("sentenceId", "value")) expect_true(is.character(testResult$sentenceId)) expect_true(is.numeric(testResult$value)) }) # test bad inputs --------------------------------------- test_that("bad inputs", { testDocs <- c("Testing 1, 2, 3.", "Is everything working as expected in my test?", "Is it working?") tokenDf <- sentenceTokenParse(testDocs)$tokens similDf <- sentenceSimil(sentenceId = tokenDf$sentenceId, token = tokenDf$token, docId = tokenDf$docId) expect_error(lexRankFromSimil(NULL, similDf$sent2, similDf$similVal)) expect_error(lexRankFromSimil(c(1,2), similDf$sent2, similDf$similVal)) expect_error(lexRankFromSimil(similDf$sent1, similDf$sent2, c("a","b","c"))) expect_error(lexRankFromSimil(similDf$sent1, similDf$sent2, similDf$similVal, threshold = NULL)) expect_error(lexRankFromSimil(similDf$sent1, similDf$sent2, similDf$similVal, damping = NULL)) }) # test object out value test_that("object out value", { testDocs <- c("Testing 1, 2, 3.", "Is everything working as expected in my test?", "Is it working?") tokenDf <- sentenceTokenParse(testDocs)$tokens similDf <- sentenceSimil(sentenceId = tokenDf$sentenceId, token = tokenDf$token, docId = tokenDf$docId) testResult <- lexRankFromSimil(similDf$sent1, similDf$sent2, similDf$similVal) testResult$value = round(testResult$value, 5) expectedResult <- data.frame(sentenceId = c("1_1", "2_1", "3_1"), value = c(0.25676, 0.48649, 0.25676), stringsAsFactors = FALSE) expect_identical(testResult, expectedResult) testResult <- lexRankFromSimil(similDf$sent1, similDf$sent2, similDf$similVal, continuous = TRUE) testResult$value = round(testResult$value, 5) expectedResult <- data.frame(sentenceId = c("1_1", "2_1", "3_1"), value = c(0.25676, 0.48649, 0.25676), stringsAsFactors = FALSE) expect_identical(testResult, expectedResult) testResult <- lexRankFromSimil(similDf$sent1, similDf$sent2, similDf$similVal, usePageRank = FALSE) testResult$value = round(testResult$value, 5) expectedResult <- data.frame(sentenceId = c("2_1", "1_1", "3_1"), value = c(2, 1, 1), stringsAsFactors = FALSE) expect_identical(testResult, expectedResult) }) lexRankr/tests/testthat/test-idfCosine.R0000644000176200001440000000266413177136432020064 0ustar liggesuserscontext("lexRankr:::idfCosineSimil") # test bad inputs --------------------------------------- test_that("bad inputs to idf cosine", { expect_error(lexRankr:::idfCosineSimil(NULL)) badMat <- matrix(c("a","b","c","d"), nrow=2) expect_error(lexRankr:::idfCosineSimil(badMat)) }) # test object out str and class --------------------------------------- test_that("object out str and class", { testMat <- matrix(runif(9, min = .01, max = 1), nrow=3) testResult <- lexRankr:::idfCosineSimil(testMat) expect_equal(class(testResult), "numeric") expect_equal(length(testResult), 3) }) # test object out value test_that("object out value", { testMat <- matrix(c(1,0,0,0,1,0,0,0,1), nrow=3) expect_equal(lexRankr:::idfCosineSimil(testMat), c(0,0,0)) testMat <- matrix(c(0,0,0,0,0,0,0,0,0), nrow=3) expect_equal(lexRankr:::idfCosineSimil(testMat), c(NaN,NaN,NaN)) testMat <- matrix(c(1,1,1,1,1,1,1,1,1), nrow=3) expect_equal(lexRankr:::idfCosineSimil(testMat), c(1,1,1)) testMat <- matrix(runif(9, min = .01, max = 1), nrow=3) rcppIdf <- round(lexRankr:::idfCosineSimil(testMat), 10) #pure r version comparison idfCosine <- function(x,y) { res <- sum(x*y)/(sqrt(sum(x^2))*sqrt(sum(y^2))) return(round(res, 10)) } elem1 <- idfCosine(testMat[1,], testMat[2,]) elem2 <- idfCosine(testMat[1,], testMat[3,]) elem3 <- idfCosine(testMat[2,], testMat[3,]) expect_equal(rcppIdf, c(elem1, elem2, elem3)) }) lexRankr/tests/testthat/test-sentenceTokenParse.R0000644000176200001440000000305013213610270021733 0ustar liggesuserscontext("sentenceTokenParse") # test output classes ---------------------------------------- test_that("object class and structure check", { testDocs <- c("12345", "Testing 1, 2, 3.", "Is everything working as expected Mr. Wickham?") testResult <- sentenceTokenParse(testDocs) expect_equal(class(testResult), "list") expect_equal(unique(vapply(testResult, class, character(1))), "data.frame") expect_equal(names(testResult$tokens), c("docId","sentenceId","token")) expect_true(is.numeric(testResult$tokens$docId)) expect_true(is.character(testResult$tokens$sentenceId)) expect_true(is.character(testResult$tokens$sentence)) }) # test output value ------------------------------------------- test_that("All clean options TRUE", { testDocs <- c("Testing 1, 2, 3.", "Is everything working as expected Mr. Wickham?") testResult <- sentenceTokenParse(testDocs, docId = "create", removePunc=TRUE, removeNum=TRUE, toLower=TRUE, stemWords=TRUE, rmStopWords=TRUE) expectedResultSentences <- sentenceParse(testDocs) expectedResultTokens <- unlist(lexRankr::tokenize(testDocs)) expectedResultTokens <- expectedResultTokens[which(!is.na(expectedResultTokens))] expect_equal(testResult$sentences, expectedResultSentences) expect_equal(testResult$tokens$token, expectedResultTokens) expect_equal(class(testResult), "list") }) lexRankr/tests/testthat/test-sentenceSimil.R0000644000176200001440000001004513213603250020737 0ustar liggesuserscontext("sentenceSimil") # test object out str and class --------------------------------------- test_that("testing result str and class", { testDocs <- c("Testing 1, 2, 3.", "Is everything working as expected in my test?", "Is it working?") tokenDf <- sentenceTokenParse(testDocs)$tokens testResult <- sentenceSimil(sentenceId = tokenDf$sentenceId, token = tokenDf$token, docId = tokenDf$docId, sentencesAsDocs = FALSE) expect_equal(class(testResult), "data.frame") expect_equal(names(testResult), c("sent1","sent2","similVal")) expect_true(is.character(testResult$sent1)) expect_true(is.character(testResult$sent2)) expect_true(is.numeric(testResult$similVal)) testResult <- sentenceSimil(sentenceId = tokenDf$sentenceId, token = tokenDf$token, docId = tokenDf$docId, sentencesAsDocs = TRUE) expect_equal(class(testResult), "data.frame") expect_equal(names(testResult), c("sent1","sent2","similVal")) expect_true(is.character(testResult$sent1)) expect_true(is.character(testResult$sent2)) expect_true(is.numeric(testResult$similVal)) testResult <- sentenceSimil(sentenceId = tokenDf$sentenceId, token = tokenDf$token, docId = tokenDf$docId, sentencesAsDocs = TRUE) expect_equal(class(testResult), "data.frame") expect_equal(names(testResult), c("sent1","sent2","similVal")) expect_true(is.character(testResult$sent1)) expect_true(is.character(testResult$sent2)) expect_true(is.numeric(testResult$similVal)) }) test_that("bad input", { expect_error(sentenceSimil(sentenceId = c("1_1"), token = c("word","word2"), docId = c(1,2))) expect_error(sentenceSimil(sentenceId = c("1_1", "2_1"), token = c(1,2), docId = c(1,2))) #was relevant when using idf calc w/o bounding at 1 # testDocs <- c("test","test") # tokenDf <- sentenceTokenParse(testDocs)$tokens # # expect_error(sentenceSimil(sentenceId = tokenDf$sentenceId, # token = tokenDf$token, # docId = tokenDf$docId)) testDocs <- c("1","2") tokenDf <- sentenceTokenParse(testDocs)$tokens expect_error(sentenceSimil(sentenceId = tokenDf$sentenceId, token = tokenDf$token, docId = tokenDf$docId)) }) # test output value --------------------------------------- test_that("output value check", { testDocs <- c("Testing 1, 2, 3.", "Is everything working as expected in my test?", "Is it working?") tokenDf <- sentenceTokenParse(testDocs)$tokens testResult <- sentenceSimil(sentenceId = tokenDf$sentenceId, token = tokenDf$token, docId = tokenDf$docId, sentencesAsDocs = FALSE) testResult$similVal = round(testResult$similVal, 5) expectedResult <- data.frame(sent1 = c("1_1", "1_1", "2_1"), sent2 = c("2_1", "3_1", "3_1"), similVal = c(0.48624, 0, 0.48624), stringsAsFactors = FALSE) expect_equal(testResult, expectedResult) testResult <- sentenceSimil(sentenceId = tokenDf$sentenceId, token = tokenDf$token, docId = tokenDf$docId, sentencesAsDocs = TRUE) testResult$similVal = round(testResult$similVal, 5) expectedResult <- data.frame(sent1 = c("1_1", "1_1", "2_1"), sent2 = c("2_1", "3_1", "3_1"), similVal = c(0.48624, 0, 0.48624), stringsAsFactors = FALSE) expect_equal(testResult, expectedResult) }) lexRankr/tests/testthat/test-unnest_sentences_.R0000644000176200001440000000663713213603250021673 0ustar liggesuserscontext("unnest_sentences_") # test output str -------------------------------------------------------- test_that("correct ouput class and str", { df <- data.frame(doc_id = 1:3, text = c("Testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) test_result <- unnest_sentences_(df, "out", "text") expect_equal(dim(test_result), c(4,3)) expect_true(is.data.frame(test_result)) expect_equal(names(test_result), c("doc_id","sent_id","out")) test_result <- unnest_sentences_(df, "out", "text", drop=FALSE) expect_equal(dim(test_result), c(4,4)) expect_equal(names(test_result), c("doc_id","text","sent_id","out")) }) # test bad input ------------------------------------------------------- test_that("test input checking", { df <- data.frame(doc_id = 1:3, text = c("Testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) expect_error(unnest_sentences_(df, "out", "fake")) expect_error(unnest_sentences_(NULL, "out", "text")) expect_error(unnest_sentences_(df, "out", "text", drop = NULL)) expect_error(unnest_sentences(df, "out", "text", doc_id = "fake")) expect_warning(unnest_sentences_(df, "out", "text", output_id=c("test","test2"))) }) # test output val ------------------------------------------------------ test_that("output value", { df <- data.frame(doc_id = 1:3, text = c("Testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) test_result <- unnest_sentences_(df, "out", "text") expected_result <- data.frame(doc_id = c(1L, 1L, 2L, 3L), sent_id = c(1L, 2L, 1L, 1L), out = c("Testing the system.", "Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) expect_equal(test_result, expected_result) df <- data.frame(doc_id = c(1,1,3), text = c("Testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) test_result <- unnest_sentences_(df, "out", "text", doc_id = "doc_id") expected_result <- data.frame(doc_id = c(1L, 1L, 1L, 3L), sent_id = c(1L, 2L, 3L, 1L), out = c("Testing the system.", "Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) expect_equal(test_result, expected_result) }) lexRankr/tests/testthat/test-tokenize.R0000644000176200001440000001163713177136432020011 0ustar liggesuserscontext("tokenize") # test tokenize output classes ---------------------------------------- test_that("All clean options TRUE", { testDocs <- c("12345", "Testing 1, 2, 3.", "Is everything working as expected Mr. Wickham?") testResult <- tokenize(testDocs) expect_equal(class(testResult), "list") expect_equal(unique(vapply(testResult, class, character(1))), "character") }) # test bad input ------------------------------------------------------- test_that("test input checking", { expect_error(tokenize(NULL)) expect_error(tokenize(data.frame(badInput="test"))) expect_error(tokenize("test", removePunc=NULL)) expect_error(tokenize("test", removeNum=NULL)) expect_error(tokenize("test", toLower=NULL)) expect_error(tokenize("test", stemWords=NULL)) expect_error(tokenize("test", rmStopWords=NULL)) }) # test tokenize and arg option variations ------------------------------ test_that("All clean options TRUE", { testDocs <- c("Testing 1, 2, 3.", "Is everything working as expected Mr. Wickham?") testResult <- tokenize(testDocs, removePunc=TRUE, removeNum=TRUE, toLower=TRUE, stemWords=TRUE, rmStopWords=TRUE) expectedResult <- list("test", c("work","expect","mr","wickham")) expect_equal(testResult, expectedResult) expect_equal(class(testResult), "list") }) test_that("All clean options FALSE", { testDocs <- c("Testing 1, 2, 3", "Is everything working as expected Mr. Wickham?") testResult <- tokenize(testDocs, removePunc=FALSE, removeNum=FALSE, toLower=FALSE, stemWords=FALSE, rmStopWords=FALSE) expectedResult <- list(c("Testing", "1", ",", "2", ",", "3"), c("Is", "everything", "working", "as", "expected", "Mr", ".", "Wickham", "?")) expect_equal(testResult, expectedResult) expect_equal(class(testResult), "list") }) test_that("Single option tests: removePunc = FALSE", { testDocs <- c("Testing 1, 2, 3.", "Is everything working as expected Mr. Wickham?") testResult <- tokenize(testDocs, removePunc=FALSE, removeNum=TRUE, toLower=TRUE, stemWords=TRUE, rmStopWords=TRUE) expectedResult <- list(c("test",",",",","." ), c("work","expect","mr",".","wickham","?" )) expect_equal(testResult, expectedResult) expect_equal(class(testResult), "list") }) test_that("Single option tests: removeNum = FALSE", { testDocs <- c("Testing 1, 2, 3", "Is everything working as expected Mr. Wickham?") testResult <- tokenize(testDocs, removePunc=TRUE, removeNum=FALSE, toLower=TRUE, stemWords=TRUE, rmStopWords=TRUE) expectedResult <- list(c("test","1","2","3"), c("work","expect","mr","wickham")) expect_equal(testResult, expectedResult) expect_equal(class(testResult), "list") }) test_that("Single option tests: toLower = FALSE", { testDocs <- c("Testing 1, 2, 3", "Is everything working as expected Mr. Wickham?") testResult <- tokenize(testDocs, removePunc=TRUE, removeNum=TRUE, toLower=FALSE, stemWords=TRUE, rmStopWords=TRUE) expectedResult <- list(c("Test"), c("work","expect","Mr","Wickham")) expect_equal(testResult, expectedResult) expect_equal(class(testResult), "list") }) test_that("Single option tests: stemWords = FALSE", { testDocs <- c("Testing 1, 2, 3", "Is everything working as expected Mr. Wickham?") testResult <- tokenize(testDocs, removePunc=TRUE, removeNum=TRUE, toLower=TRUE, stemWords=FALSE, rmStopWords=TRUE) expectedResult <- list(c("testing"), c("working","expected","mr","wickham")) expect_equal(testResult, expectedResult) expect_equal(class(testResult), "list") }) test_that("Single option tests: rmStopWords = FALSE", { testDocs <- c("Testing 1, 2, 3", "Is everything working as expected Mr. Wickham?") testResult <- tokenize(testDocs, removePunc=TRUE, removeNum=TRUE, toLower=TRUE, stemWords=TRUE, rmStopWords=FALSE) expectedResult <- list(c("test"), c("i","everyth","work","a", "expect", "mr", "wickham")) expect_equal(testResult, expectedResult) expect_equal(class(testResult), "list") }) lexRankr/tests/testthat/test-sentenceParse.R0000644000176200001440000000324213177136432020751 0ustar liggesuserscontext("sentenceParse") # test sentence object structure----------------------------------------------- test_that("sentenceParse output class and structure check", { testDoc <- "Testing one, two, three. Is everything working as expected Mr. Wickham?" testResult <- sentenceParse(testDoc) expect_equal(class(testResult), "data.frame") expect_equal(names(testResult), c("docId", "sentenceId", "sentence")) expect_true(is.numeric(testResult$docId)) expect_true(is.character(testResult$sentenceId)) expect_true(is.character(testResult$sentence)) }) # test bad input ------------------------------------------------------- test_that("test input checking", { expect_error(sentenceParse(NULL)) expect_error(sentenceParse(data.frame(badInput="test"))) expect_error(sentenceParse("test", docId = c("fake","fake2"))) expect_error(sentenceParse(c("test","test2"), docId = "fake")) expect_error(sentenceParse(c("test","test2"), docId = NULL)) }) # test sentence output value ----------------------------------------------- test_that("Example doc parses sentences as expected", { testDoc <- "Testing one, two, three. Is everything working as expected Mr. Wickham?" testResult <- sentenceParse(testDoc) expectedResult <- data.frame(docId = c(1L, 1L), sentenceId = c("1_1", "1_2"), sentence = c("Testing one, two, three.", "Is everything working as expected Mr. Wickham?"), stringsAsFactors = FALSE) expect_equal(testResult, expectedResult) expect_equal(class(testResult), "data.frame") expect_equal })lexRankr/tests/testthat/test-lexRank.R0000644000176200001440000000302413213603250017540 0ustar liggesuserscontext("lexRank") # test object out str and class --------------------------------------- test_that("object out str and class", { testDocs <- c("Testing 1, 2, 3.", "Is everything working as expected in my test?", "Is it working?") testResult <- lexRank(testDocs, Verbose = FALSE) expect_equal(class(testResult), "data.frame") expect_equal(names(testResult), c("docId","sentenceId", "sentence","value")) expect_true(is.character(testResult$sentenceId)) expect_true(is.character(testResult$sentence)) expect_true(is.numeric(testResult$value)) }) # test bad inputs --------------------------------------- test_that("bad inputs", { expect_error(lexRank(FALSE, Verbose = FALSE)) expect_error(lexRank(NULL, Verbose = FALSE)) }) # test object out value test_that("object out value", { testDocs <- c("Testing 1, 2, 3.", "Is everything working as expected in my test?", "Is it working?") testResult <- lexRank(testDocs, Verbose = FALSE) testResult$value = round(testResult$value, 5) expectedResult <- data.frame(docId = c(2L, 1L, 3L), sentenceId = c("2_1", "1_1", "3_1"), sentence = c("Is everything working as expected in my test?", "Testing 1, 2, 3.", "Is it working?"), value = c(0.48649, 0.25676, 0.25676), stringsAsFactors = FALSE) expect_identical(testResult, expectedResult) }) lexRankr/tests/testthat/test-bind_lexrank_.R0000644000176200001440000002175513213611117020747 0ustar liggesuserscontext("bind_lexrank_") # test output str -------------------------------------------------------- test_that("correct ouput class and str", { df <- data.frame(doc_id = 1:3, text = c("Testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) test_result <- unnest_sentences(df, sents, text) test_result <- bind_lexrank_(test_result, "sents", "doc_id", level = 'sentences') expect_equal(dim(test_result), c(4,4)) expect_true(is.data.frame(test_result)) expect_equal(names(test_result), c("doc_id","sent_id","sents","lexrank")) test_result <- unnest_sentences(df, sents, text, drop=FALSE) test_result <- bind_lexrank_(test_result, "sents", "doc_id", level = 'sentences') expect_equal(dim(test_result), c(4,5)) expect_equal(names(test_result), c("doc_id","text","sent_id","sents","lexrank")) df <- data.frame(doc_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), sent_id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), sents = c("Testing the system.", "Testing the system.", "Testing the system.", "Second sentence for you.", "Second sentence for you.", "Second sentence for you.", "Second sentence for you.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked."), tokens = c("testing", "the", "system", "second", "sentence", "for", "you", "system", "testing", "the", "tidy", "documents", "df", "documents", "will", "be", "parsed", "and", "lexranked"), stringsAsFactors = FALSE) test_result <- bind_lexrank_(df, "tokens", "doc_id", "sent_id", "tokens") expect_equal(dim(test_result), c(19,5)) expect_equal(names(test_result), c("doc_id","sent_id","sents","tokens","lexrank")) }) # test bad input ------------------------------------------------------- test_that("test input checking", { df <- data.frame(doc_id = 1:3, text = c("Testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) df <- unnest_sentences(df, sents, text) expect_error(bind_lexrank_(df, "sents", "fake")) expect_error(bind_lexrank_(NULL, "sents", "doc_id")) expect_error(bind_lexrank_(df, "sents", "doc_id", level="fake")) # expect_warning(bind_lexrank_(df, "sents", "doc_id", level=c("sentences","tokens"))) df <- data.frame(doc_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), sent_id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), sents = c("Testing the system.", "Testing the system.", "Testing the system.", "Second sentence for you.", "Second sentence for you.", "Second sentence for you.", "Second sentence for you.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked."), tokens = c("testing", "the", "system", "second", "sentence", "for", "you", "system", "testing", "the", "tidy", "documents", "df", "documents", "will", "be", "parsed", "and", "lexranked"), stringsAsFactors = FALSE) expect_error(bind_lexrank_(df, "tokens", "doc_id", "fake", level="tokens")) expect_error(bind_lexrank_(df, "tokens", "doc_id", level="tokens")) # expect_warning(bind_lexrank_(df, "tokens", "doc_id", "sent_id", level=c("tokens","sentences"))) }) # test output val ------------------------------------------------------ test_that("output value", { df <- data.frame(doc_id = 1:3, text = c("Testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) df <- unnest_sentences(df, sents, text) test_result <- bind_lexrank_(df, "sents", "doc_id", level="sentences") expected_result <- data.frame(doc_id = c(1L, 1L, 2L, 3L), sent_id = c(1L, 2L, 1L, 1L), sents = c("Testing the system.", "Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), lexrank = c(0.5, NA, 0.5, NA), stringsAsFactors = FALSE) expect_equal(test_result, expected_result) df <- data.frame(doc_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), sent_id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), sents = c("Testing the system.", "Testing the system.", "Testing the system.", "Second sentence for you.", "Second sentence for you.", "Second sentence for you.", "Second sentence for you.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked."), tokens = c("testing", "the", "system", "second", "sentence", "for", "you", "system", "testing", "the", "tidy", "documents", "df", "documents", "will", "be", "parsed", "and", "lexranked"), stringsAsFactors = FALSE) test_result <- bind_lexrank_(df, "tokens", "doc_id", "sent_id", level="sentences") test_result$lexrank <- round(test_result$lexrank, 5) expected_result <- data.frame(doc_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), sent_id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), sents = c("Testing the system.", "Testing the system.", "Testing the system.", "Second sentence for you.", "Second sentence for you.", "Second sentence for you.", "Second sentence for you.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked."), tokens = c("testing", "the", "system", "second", "sentence", "for", "you", "system", "testing", "the", "tidy", "documents", "df", "documents", "will", "be", "parsed", "and", "lexranked"), lexrank = c(0.16667, NA, 0.16667, NA, NA, NA, NA, 0.16667, 0.16667, NA, NA, 0.16667, NA, 0.16667, NA, NA, NA, NA, NA), stringsAsFactors = FALSE) expect_equal(test_result, expected_result) }) lexRankr/tests/testthat/test-bind_lexrank.R0000644000176200001440000002164713213610767020622 0ustar liggesuserscontext("bind_lexrank") # test output str -------------------------------------------------------- test_that("correct ouput class and str", { df <- data.frame(doc_id = 1:3, text = c("Testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) test_result <- unnest_sentences(df, sents, text) test_result <- bind_lexrank(test_result, sents, doc_id, level = 'sentences') expect_equal(dim(test_result), c(4,4)) expect_true(is.data.frame(test_result)) expect_equal(names(test_result), c("doc_id","sent_id","sents","lexrank")) test_result <- unnest_sentences(df, sents, text, drop=FALSE) test_result <- bind_lexrank(test_result, sents,doc_id, level = 'sentences') expect_equal(dim(test_result), c(4,5)) expect_equal(names(test_result), c("doc_id","text","sent_id","sents","lexrank")) df <- data.frame(doc_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), sent_id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), sents = c("Testing the system.", "Testing the system.", "Testing the system.", "Second sentence for you.", "Second sentence for you.", "Second sentence for you.", "Second sentence for you.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked."), tokens = c("testing", "the", "system", "second", "sentence", "for", "you", "system", "testing", "the", "tidy", "documents", "df", "documents", "will", "be", "parsed", "and", "lexranked"), stringsAsFactors = FALSE) test_result <- bind_lexrank(df, tokens, doc_id, sent_id, "tokens") expect_equal(dim(test_result), c(19,5)) expect_equal(names(test_result), c("doc_id","sent_id","sents","tokens","lexrank")) }) # test bad input ------------------------------------------------------- test_that("test input checking", { df <- data.frame(doc_id = 1:3, text = c("Testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) df <- unnest_sentences(df, sents, text) expect_error(bind_lexrank(df, sents, fake)) expect_error(bind_lexrank(NULL, sents, doc_id)) expect_error(bind_lexrank(df, sents, doc_id, level="fake")) # expect_warning(bind_lexrank(df, sents, doc_id, level=c("sentences","tokens"))) df <- data.frame(doc_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), sent_id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), sents = c("Testing the system.", "Testing the system.", "Testing the system.", "Second sentence for you.", "Second sentence for you.", "Second sentence for you.", "Second sentence for you.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked."), tokens = c("testing", "the", "system", "second", "sentence", "for", "you", "system", "testing", "the", "tidy", "documents", "df", "documents", "will", "be", "parsed", "and", "lexranked"), stringsAsFactors = FALSE) expect_error(bind_lexrank(df, tokens, doc_id, fake, level="tokens")) expect_error(bind_lexrank(df, tokens, doc_id, level="tokens")) # expect_warning(bind_lexrank(df, tokens, doc_id, sent_id, level=c("tokens","sentences"))) }) # test output val ------------------------------------------------------ test_that("output value", { df <- data.frame(doc_id = 1:3, text = c("Testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) df <- unnest_sentences(df, sents, text) test_result <- bind_lexrank(df, sents, doc_id, level="sentences") expected_result <- data.frame(doc_id = c(1L, 1L, 2L, 3L), sent_id = c(1L, 2L, 1L, 1L), sents = c("Testing the system.", "Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), lexrank = c(0.5, NA, 0.5, NA), stringsAsFactors = FALSE) expect_equal(test_result, expected_result) df <- data.frame(doc_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), sent_id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), sents = c("Testing the system.", "Testing the system.", "Testing the system.", "Second sentence for you.", "Second sentence for you.", "Second sentence for you.", "Second sentence for you.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked."), tokens = c("testing", "the", "system", "second", "sentence", "for", "you", "system", "testing", "the", "tidy", "documents", "df", "documents", "will", "be", "parsed", "and", "lexranked"), stringsAsFactors = FALSE) test_result <- bind_lexrank(df, tokens, doc_id, sent_id, level="sentences") test_result$lexrank <- round(test_result$lexrank, 5) expected_result <- data.frame(doc_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), sent_id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), sents = c("Testing the system.", "Testing the system.", "Testing the system.", "Second sentence for you.", "Second sentence for you.", "Second sentence for you.", "Second sentence for you.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "System testing the tidy documents df.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked.", "Documents will be parsed and lexranked."), tokens = c("testing", "the", "system", "second", "sentence", "for", "you", "system", "testing", "the", "tidy", "documents", "df", "documents", "will", "be", "parsed", "and", "lexranked"), lexrank = c(0.16667, NA, 0.16667, NA, NA, NA, NA, 0.16667, 0.16667, NA, NA, 0.16667, NA, 0.16667, NA, NA, NA, NA, NA), stringsAsFactors = FALSE) expect_equal(test_result, expected_result) }) lexRankr/src/0000755000176200001440000000000013443530264012633 5ustar liggesuserslexRankr/src/idfCosineSimil.cpp0000644000176200001440000000147313443530264016245 0ustar liggesusers#include using namespace Rcpp; double idfCosineSimilVector(NumericVector x, NumericVector y) { int n=x.size(); double numerator=0; double denomenatorX=0; double denomenatorY=0; double result; for (int i = 0; i #include #include // for NULL #include /* FIXME: Check these declarations against the C/Fortran source code. */ /* .Call calls */ extern SEXP _lexRankr_idfCosineSimil(SEXP); static const R_CallMethodDef CallEntries[] = { {"_lexRankr_idfCosineSimil", (DL_FUNC) &_lexRankr_idfCosineSimil, 1}, {NULL, NULL, 0} }; void R_init_lexRankr(DllInfo *dll) { R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); R_useDynamicSymbols(dll, FALSE); } lexRankr/src/RcppExports.cpp0000644000176200001440000000104013443530264015623 0ustar liggesusers// Generated by using Rcpp::compileAttributes() -> do not edit by hand // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 #include using namespace Rcpp; // idfCosineSimil NumericVector idfCosineSimil(NumericMatrix mat); RcppExport SEXP _lexRankr_idfCosineSimil(SEXP matSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< NumericMatrix >::type mat(matSEXP); rcpp_result_gen = Rcpp::wrap(idfCosineSimil(mat)); return rcpp_result_gen; END_RCPP } lexRankr/NAMESPACE0000644000176200001440000000055413443526447013277 0ustar liggesusers# Generated by roxygen2: do not edit by hand export(bind_lexrank) export(bind_lexrank_) export(lexRank) export(lexRankFromSimil) export(sentenceParse) export(sentenceSimil) export(sentenceTokenParse) export(tokenize) export(unnest_sentences) export(unnest_sentences_) importFrom(Rcpp,sourceCpp) importFrom(stats,xtabs) importFrom(utils,combn) useDynLib(lexRankr) lexRankr/NEWS.md0000644000176200001440000000341013401567171013141 0ustar liggesusers # lexRankr 0.5.2 * fix damping bug where damping parameter wasn't passed to `igraph::pagerank` # lexRankr 0.5.1 * changed `smart_stopwords` to be internal data so that package doesnt need to be explicitly loaded with `library` to be able to parse # lexRankr 0.5.0 * bug fix in sentence parsing for parsing exclamatory sentences * converted idf calculation from `idf(d, t) = log( n / df(d, t) )` to `idf(d, t) = log( n / df(d, t) ) + 1` to avoid zeroing out common word tfidf values * removed dplyr, tidyr, stringr, magrittr, & tm as dependencies * created option to bypass assumption that each row/vector-element are different documents in `lexRank` and `unnest_sentences` * various bug fixes in token & sentence parsing # lexRankr 0.4.1 * added bug report url: (https://github.com/AdamSpannbauer/lexRankr/issues/) * formatting updates to README.md # lexRankr 0.4.0 * added functions `unnest_sentences` and `unnest_sentences_` to parse sentences in a dataframe following tidy data principles * added functions `bind_lexrank` and `bind_lexrank_` to calculate lexrank scores for sentences in a dataframe following tidy data principles (`unnest_sentences` & `bind_lexrank` can be used on a df in a magrittr pipeline) * added vignette for using lexrank to analyze tweets # lexRankr 0.3.0 * sentence similarity from `sentenceSimil` now calculated using Rcpp. Improves speed by ~25%-30% over old implementation using `proxy` package # lexRankr 0.2.0 * Added logic to avoid naming conflicts in proxy::pr_DB in `sentenceSimil` (#1, @AdamSpannbauer) * Added check and error for cases where no sentences above threshold in `lexRankFromSimil` (#2, @AdamSpannbauer) * `tokenize` now has stricter punctuation removal. Removes all non-alphnumeric characters as opposed to removing `[:punct:]` lexRankr/data/0000755000176200001440000000000013213604031012742 5ustar liggesuserslexRankr/data/smart_stopwords.rda0000644000176200001440000000406613213604031016712 0ustar liggesusersBZh91AY&SYX @`[^y꽺\+;was,CSh4cCPi0PɃH"zC@d4S$!S#BCE=OBbj=M=QS`2`OI꧓z?SSjb`!b!YI ՠ0`g~:?/ hZɬrY{+Ԯ1vW !*w":^2I<y߄&}IfYFӀѡ/ +![Nvf(/Q- Ej'(Ggzz륨d5hW`MŢmA/2!<Ư9jC,L(Ӱ47 J#X"Lb\lh3+&8A5@/0) N`$(è0u )M"믍{~$2GH!XI=@yϦs33̽_-qĆ]9n,gWFkQqbp;S[ZȳZom![ 0k$T8Ј9NԒk}r!XIdcSbooֹQ>LFb$/֟Vز^3B>s}+R;  TWc2cޭvv.ݳeE,P:ߗP{F%Vf`W+M6j mIj3vaXi,C͔"a6%;\*M[5"gl5dq&U"[tH]v!7lMa=\Xü (T{mc@ʼn'dn/ن•Dvq{viVMlkFK= fG`Vi145UDQ<fYL|(ҳQ 3;%fEG}mѿ{\e1o1$R7mj@`BUKt.&9atֈԆsD݁!`uZ;x@.Xae>q`4PZ#Uẘp,t\{N vZ; ^\ӷl,:!qFsԾHFJNw =ퟍIgQϠYg rK 9H;1.~k`PY(\Bm0i!2 ,$/h䤰U{匘[{dǸkb-ʣ-=7suA|vR>I ߿iø DCX[?l FA!Q&$G*Rq:H8T` N ˴+)oDsk;=5X^P/)>Q[$s/{ ~^[tه%N[{C#HhV2ji LM2N9H  lexRankr/R/0000755000176200001440000000000013403437661012250 5ustar liggesuserslexRankr/R/sentenceTokenParse.R0000644000176200001440000000604713401570740016174 0ustar liggesusers#' Parse text into sentences and tokens #' @description Parse a character vector of documents into into both sentences and a clean vector of tokens. The resulting output includes IDs for document and sentence for use in other \code{lexRank} functions. #' @param text A character vector of documents to be parsed into sentences and tokenized. #' @param docId A character vector of document Ids the same length as \code{text}. If \code{docId=="create"} document Ids will be created. #' @param removePunc \code{TRUE} or \code{FALSE} indicating whether or not to remove punctuation from \code{text} while tokenizing. If \code{TRUE}, punctuation will be removed. Defaults to \code{TRUE}. #' @param removeNum \code{TRUE} or \code{FALSE} indicating whether or not to remove numbers from \code{text} while tokenizing. If \code{TRUE}, numbers will be removed. Defaults to \code{TRUE}. #' @param toLower \code{TRUE} or \code{FALSE} indicating whether or not to coerce all of \code{text} to lowercase while tokenizing. If \code{TRUE}, \code{text} will be coerced to lowercase. Defaults to \code{TRUE}. #' @param stemWords \code{TRUE} or \code{FALSE} indicating whether or not to stem resulting tokens. If \code{TRUE}, the outputted tokens will be tokenized using \code{SnowballC::wordStem()}. Defaults to \code{TRUE}. #' @param rmStopWords \code{TRUE}, \code{FALSE}, or character vector of stopwords to remove from tokens. If \code{TRUE}, words in \code{lexRankr::smart_stopwords} will be removed prior to stemming. If \code{FALSE}, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to \code{TRUE}. #' @return A list of dataframes. The first element of the list returned is the \code{sentences} dataframe; this dataframe has columns \code{docId}, \code{sentenceId}, & \code{sentence} (the actual text of the sentence). The second element of the list returned is the \code{tokens} dataframe; this dataframe has columns \code{docId}, \code{sentenceId}, & \code{token} (the actual text of the token). #' @examples #' sentenceTokenParse(c("Bill is trying to earn a Ph.D.", "You have to have a 5.0 GPA."), #' docId=c("d1","d2")) #' @export sentenceTokenParse <- function(text, docId = "create", removePunc=TRUE, removeNum=TRUE, toLower=TRUE, stemWords=TRUE, rmStopWords=TRUE){ sentenceDf <- sentenceParse(text, docId=docId) tokenDfList <- lapply(seq_along(sentenceDf$sentence), function(i) { sentVec <- sentenceDf$sentence[i] tokenList <- tokenize(text = sentVec, removePunc = removePunc, removeNum = removeNum, toLower = toLower, stemWords = stemWords, rmStopWords=rmStopWords) subTokenDfList <- lapply(seq_along(tokenList), function(j) { data.frame(docId=sentenceDf$docId[i], sentenceId=sentenceDf$sentenceId[i], token=tokenList[[j]], stringsAsFactors = FALSE) }) do.call('rbind', subTokenDfList) }) tokenDf <- do.call('rbind', tokenDfList) tokenDf <- tokenDf[!is.na(tokenDf$token),] class(tokenDf) <- "data.frame" list(sentences=sentenceDf, tokens=tokenDf) } lexRankr/R/bind_lexrank.R0000644000176200001440000001460213213611366015031 0ustar liggesusers#' Bind lexrank scores to a dataframe of text #' @description Bind lexrank scores to a dataframe of sentences or to a dataframe of tokens with sentence ids #' @param tbl dataframe containing column of sentences to be lexranked #' @param text name of column containing sentences or tokens to be lexranked #' @param doc_id name of column containing document ids corresponding to \code{text} #' @param sent_id Only needed if \code{level} is "tokens". name of column containing sentence ids corresponding to \code{text} #' @param level the parsed level of the text column to be lexranked. i.e. is \code{text} a column of "sentences" or "tokens"? The "tokens" level is provided to allow users to implement custom tokenization. Note: even if the input \code{level} is "tokens" lexrank scores are assigned at the sentence level. #' @param threshold The minimum simililarity value a sentence pair must have to be represented in the graph where lexRank is calculated. #' @param usePageRank \code{TRUE} or \code{FALSE} indicating whether or not to use the page rank algorithm for ranking sentences. If \code{FALSE}, a sentences unweighted centrality will be used as the rank. Defaults to \code{TRUE}. #' @param damping The damping factor to be passed to page rank algorithm. Ignored if \code{usePageRank} is \code{FALSE}. #' @param continuous \code{TRUE} or \code{FALSE} indicating whether or not to use continuous LexRank. Only applies if \code{usePageRank==TRUE}. If \code{TRUE}, \code{threshold} will be ignored and lexRank will be computed using a weighted graph representation of the sentences. Defaults to \code{FALSE}. #' @param ... tokenizing options to be passed to lexRankr::tokenize. Ignored if \code{level} is "sentences" #' @return A dataframe with an additional column of lexrank scores (column is given name lexrank) #' @examples #' #' df <- data.frame(doc_id = 1:3, #' text = c("Testing the system. Second sentence for you.", #' "System testing the tidy documents df.", #' "Documents will be parsed and lexranked."), #' stringsAsFactors = FALSE) #' #' \dontrun{ #' library(magrittr) #' #' df %>% #' unnest_sentences(sents, text) %>% #' bind_lexrank(sents, doc_id, level = "sentences") #' #' df %>% #' unnest_sentences(sents, text) %>% #' bind_lexrank_("sents", "doc_id", level = "sentences") #' #' df <- data.frame(doc_id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, #' 2, 2, 2, 3, 3, 3, 3, 3, 3), #' sent_id = c(1, 1, 1, 2, 2, 2, 2, 1, 1, 1, #' 1, 1, 1, 1, 1, 1, 1, 1, 1), #' tokens = c("testing", "the", "system", "second", #' "sentence", "for", "you", "system", #' "testing", "the", "tidy", "documents", #' "df", "documents", "will", "be", "parsed", #' "and", "lexranked"), #' stringsAsFactors = FALSE) #' #' df %>% #' bind_lexrank(tokens, doc_id, sent_id, level = 'tokens') #' } #' @export bind_lexrank_ <- function(tbl, text, doc_id, sent_id=NULL, level=c("sentences", "tokens"), threshold=.2, usePageRank=TRUE, damping=0.85, continuous=FALSE, ...) { if(!is.data.frame(tbl)) stop("tbl must be a dataframe") if(!(text %in% names(tbl))) stop("text column not found in tbl") if(!(doc_id %in% names(tbl))) stop("doc_id column not found in tbl") if(!is.character(level)) stop("level must be character") if(length(level) > 1) { level = level[1] } if(!(level %in% c("sentences", "tokens"))) stop("invalid value of level; accepted values for level are 'sentences' and 'tokens'") if(level == "tokens") { if(is.null(sent_id)) stop("sent_id must be provided when level is 'tokens'") if(!(sent_id %in% names(tbl))) stop("sent_id column not found in tbl") sent_ids <- tbl[[sent_id]] } else { sent_ids <- 1:nrow(tbl) } tbl_class <- class(tbl) doc_id_class <- class(tbl[[doc_id]]) uuid_kinda <- paste0(c("a",sample(c(letters[1:6],0:9),30,replace=TRUE)), collapse = "") uuid_sep <- paste0("__", uuid_kinda,"__") doc_sent_ids <- paste0(tbl[[doc_id]], uuid_sep, sent_ids) if(level=="sentences") { sent_id <- uuid_kinda tokenDfList <- lapply(seq_along(tbl[[text]]), function(i) { sentVec <- tbl[[text]][i] tokenList <- tokenize(text = sentVec, ...) subTokenDfList <- lapply(seq_along(tokenList), function(j) { data.frame(docId=tbl[[doc_id]][i], sentenceId=doc_sent_ids[i], token=tokenList[[j]], stringsAsFactors = FALSE) }) do.call('rbind', subTokenDfList) }) tokenDf <- do.call('rbind', tokenDfList) tokenDf <- tokenDf[!is.na(tokenDf$token),] } else { tokenDf <- data.frame(docId=tbl[[doc_id]], sentenceId=doc_sent_ids, token=tbl[[text]], stringsAsFactors = FALSE) } similDf <- sentenceSimil(tokenDf$sentenceId, tokenDf$token, tokenDf$docId) topSentIdsDf <- lexRankFromSimil(similDf$sent1, similDf$sent2, similDf$similVal, threshold=threshold, n=Inf, returnTies=TRUE, usePageRank=usePageRank, damping=damping, continuous=continuous) lex_lookup <- do.call('rbind', strsplit(topSentIdsDf$sentenceId, uuid_sep, fixed=TRUE)) lex_lookup <- as.data.frame(lex_lookup) names(lex_lookup) <- c(doc_id, sent_id) class(lex_lookup[[doc_id]]) <- doc_id_class lex_lookup$lexrank <- topSentIdsDf$value if(level=="tokens") { class(lex_lookup[[sent_id]]) <- class(tbl[[sent_id]]) tbl_out <- merge(tbl, lex_lookup, all.x=TRUE, by=c(doc_id, sent_id)) } else { tbl[[uuid_kinda]] <- as.character(sent_ids) tbl_out <- merge(tbl, lex_lookup, all.x=TRUE, by=c(doc_id, uuid_kinda)) tbl_out <- tbl_out[order(as.numeric(tbl_out[[uuid_kinda]])),] tbl_out[[uuid_kinda]] <- NULL } rownames(tbl_out) <- NULL class(tbl_out) <- tbl_class tbl_out } #' @rdname bind_lexrank_ #' @export bind_lexrank <- function(tbl, text, doc_id, sent_id=NULL, level=c("sentences", "tokens"), threshold=.2, usePageRank=TRUE, damping=0.85, continuous=FALSE, ...) { text_str <- as.character(substitute(text)) doc_id_str <- as.character(substitute(doc_id)) sent_id_str <- substitute(sent_id) if (!is.null(sent_id_str)) sent_id_str <- as.character(sent_id_str) bind_lexrank_(tbl, text_str, doc_id_str, sent_id=sent_id_str, level=level, threshold=threshold, usePageRank=usePageRank, damping=damping, continuous=continuous, ...) } lexRankr/R/tokenize.R0000644000176200001440000000735113401570726014227 0ustar liggesusersutils::globalVariables(c("smart_stopwords")) #' Tokenize a character vector #' Parse the elements of a character vector into a list of cleaned tokens. #' @param text The character vector to be tokenized #' @param removePunc \code{TRUE} or \code{FALSE} indicating whether or not to remove punctuation from \code{text}. If \code{TRUE}, punctuation will be removed. Defaults to \code{TRUE}. #' @param removeNum \code{TRUE} or \code{FALSE} indicating whether or not to remove numbers from \code{text}. If \code{TRUE}, numbers will be removed. Defaults to \code{TRUE}. #' @param toLower \code{TRUE} or \code{FALSE} indicating whether or not to coerce all of \code{text} to lowercase. If \code{TRUE}, \code{text} will be coerced to lowercase. Defaults to \code{TRUE}. #' @param stemWords \code{TRUE} or \code{FALSE} indicating whether or not to stem resulting tokens. If \code{TRUE}, the outputted tokens will be tokenized using \code{SnowballC::wordStem()}. Defaults to \code{TRUE}. #' @param rmStopWords \code{TRUE}, \code{FALSE}, or character vector of stopwords to remove. If \code{TRUE}, words in \code{lexRankr::smart_stopwords} will be removed prior to stemming. If \code{FALSE}, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to \code{TRUE}. #' @examples #' tokenize("Mr. Feeny said the test would be on Sat. At least I'm 99.9% sure that's what he said.") #' tokenize("Bill is trying to earn a Ph.D. in his field.", rmStopWords=FALSE) #' @export tokenize <- function(text, removePunc=TRUE, removeNum=TRUE, toLower=TRUE, stemWords=TRUE, rmStopWords=TRUE){ if(!is.character(text)) stop("text must be character") if(length(text) < 1) stop("text must be at least length 1") if(!is.logical(removePunc)) stop("removePunc must be logical") if(length(removePunc) != 1) stop("removePunc must be length 1") if(!is.logical(removeNum)) stop("removeNum must be logical") if(length(removeNum) != 1) stop("removeNum must be length 1") if(!is.logical(toLower)) stop("toLower must be logical") if(length(toLower) != 1) stop("toLower must be length 1") if(!is.logical(stemWords)) stop("stemWords must be logical") if(length(stemWords) != 1) stop("stemWords must be length 1") if(!is.logical(rmStopWords) & !is.character(rmStopWords)) stop("rmStopWords must be a logical or a character vector") if(is.character(rmStopWords)) { rmStopWordFlag <- TRUE stopwords <- rmStopWords } else if(is.logical(rmStopWords)) { if(length(rmStopWords) != 1) stop("rmStopWords must be length 1 if passed as a logical") if(rmStopWords) { rmStopWordFlag <- TRUE stopwords <- smart_stopwords } else { rmStopWordFlag <- FALSE } } if (removePunc) text <- gsub(x=text,pattern="[^[:alnum:] ]",replacement="") if (removeNum) text <- gsub(x=text,pattern="([[:digit:]])",replacement="") if (toLower) text <- tolower(text) text <- gsub(x=text, pattern="([^[:alnum:] ])",replacement=" \\1 ") text <- trimws(gsub(x=text, pattern="\\s+",replacement=" ")) text <- strsplit(x=text, split=" ", fixed=TRUE) if(rmStopWordFlag) text <- lapply(text, function(tokens) { checkTokens <- tolower(tokens) if (!removePunc) { checkTokens <- gsub(x=checkTokens,pattern="[^[:alnum:] ]",replacement="") } nonStopTok <- tokens[which(!checkTokens %in% stopwords)] if(length(nonStopTok) == 0) NA_character_ else nonStopTok }) if(stemWords) { text <- lapply(text, function(w) { w_na = which(is.na(w)) out = SnowballC::wordStem(w) out[w_na] = NA out }) } tokenList <- lapply(text, function(tokens) { goodTok <- tokens[which(trimws(tokens) != "")] if(length(goodTok) == 0) NA_character_ else goodTok }) tokenList } lexRankr/R/lexRankFromSimil.R0000644000176200001440000001231613401570574015623 0ustar liggesusers#' Compute LexRanks from pairwise sentence similarities #' @description Compute LexRanks from sentence pair similarities using the page rank algorithm or degree centrality the methods used to compute lexRank are discussed in "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization." #' @param s1 A character vector of sentence IDs corresponding to the \code{s2} and \code{simil} arguments #' @param s2 A character vector of sentence IDs corresponding to the \code{s1} and \code{simil} arguments #' @param simil A numeric vector of similarity values that represents the similarity between the sentences represented by the IDs in \code{s1} and \code{s2}. #' @param threshold The minimum simil value a sentence pair must have to be represented in the graph where lexRank is calculated. #' @param n The number of sentences to return as the extractive summary. The function will return the top \code{n} lexRanked sentences. See \code{returnTies} for handling ties in lexRank. #' @param returnTies \code{TRUE} or \code{FALSE} indicating whether or not to return greater than \code{n} sentence IDs if there is a tie in lexRank. If \code{TRUE}, the returned number of sentences will not be limited to \code{n}, but rather will return every sentence with a top 3 score. If \code{FALSE}, the returned number of sentences will be \code{<=n}. Defaults to \code{TRUE}. #' @param usePageRank \code{TRUE} or \code{FALSE} indicating whether or not to use the page rank algorithm for ranking sentences. If \code{FALSE}, a sentences unweighted centrality will be used as the rank. Defaults to \code{TRUE}. #' @param damping The damping factor to be passed to page rank algorithm. Ignored if \code{usePageRank} is \code{FALSE}. #' @param continuous \code{TRUE} or \code{FALSE} indicating whether or not to use continuous LexRank. Only applies if \code{usePageRank==TRUE}. If \code{TRUE}, \code{threshold} will be ignored and lexRank will be computed using a weighted graph representation of the sentences. Defaults to \code{FALSE}. #' @return A 2 column dataframe with columns \code{sentenceId} and \code{value}. \code{sentenceId} contains the ids of the top \code{n} sentences in descending order by \code{value}. \code{value} contains page rank score (if \code{usePageRank==TRUE}) or degree centrality (if \code{usePageRank==FALSE}). #' @references \url{http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html} #' @examples #' lexRankFromSimil(s1=c("d1_1","d1_1","d1_2"), s2=c("d1_2","d2_1","d2_1"), simil=c(.01,.03,.5)) #' @export lexRankFromSimil <- function(s1, s2, simil, threshold=.2, n=3, returnTies=TRUE, usePageRank=TRUE, damping=0.85, continuous=FALSE) { if(!is.logical(returnTies)) stop("returnTies must be logical") if(length(returnTies) != 1) stop("returnTies must be length 1") if(!is.logical(usePageRank)) stop("usePageRank must be logical") if(length(usePageRank) != 1) stop("usePageRank must be length 1") if(!is.logical(continuous)) stop("continuous must be logical") if(length(continuous) != 1) stop("continuous must be length 1") if(!is.numeric(simil)) stop("simil must be numeric") if(!is.numeric(n)) stop("n must be numeric") if(length(n) != 1) stop("n must be length 1") if (length(s1) != length(s2) | length(s1) != length(simil)) stop("s1, s2, & simil must all be the same length") if (sum(simil) == 0) stop("all simil values are zero") if (sum(simil > threshold) == 0 & !continuous) stop("all simil values are below threshold; try lowering threshold or setting continuous to TRUE if you want to retry lexRanking this input data") s1 <- as.character(s1) s2 <- as.character(s2) if(returnTies) tieMethod <- "min" else if(!returnTies) tieMethod <- "first" edges <- data.frame(s1=s1, s2=s2, weight=simil, stringsAsFactors = FALSE) if(!continuous | !usePageRank) { if(!is.numeric(threshold)) stop("threshold must be numeric") if(length(threshold) != 1) stop("threshold must be length 1") edges <- edges[edges$weight > threshold, c("s1","s2")] } if (usePageRank) { if(!is.numeric(damping)) stop("damping must be numeric") if(length(damping) != 1) stop("damping must be length 1") sentGraph <- igraph::graph_from_data_frame(edges, directed = FALSE) sentRank <- igraph::page_rank(sentGraph, directed=FALSE, damping=damping)$vector sentRanksRanked <- rank(1/sentRank, ties.method = tieMethod) topCentral <- sentRank[which(sentRanksRanked <= n)] centralDf <- data.frame(sentenceId=names(topCentral), value=topCentral,stringsAsFactors = FALSE) rownames(centralDf) <- NULL } else if(!usePageRank){ centralDf = data.frame(sentenceId = c(edges$s1, edges$s2), stringsAsFactors = FALSE) centralDfList = split(centralDf, centralDf$sentenceId) centralDfList = lapply(centralDfList, function(dfi) { dfi[['degree']] = nrow(dfi) unique(dfi) }) centralDf = do.call('rbind', centralDfList) centralDf = centralDf[order(-centralDf$degree),] centralDf[['degRank']] = rank(1/centralDf$degree, ties.method = tieMethod) centralDf = centralDf[centralDf$degRank <= n, c("sentenceId", "degree")] names(centralDf) = c("sentenceId", "value") class(centralDf) <- "data.frame" rownames(centralDf) <- NULL } return(centralDf) } lexRankr/R/sysdata.rda0000644000176200001440000000406613215550475014416 0ustar liggesusersBZh91AY&SYX @`[^y꽺\+;was,CSh4cCPi0PɃH"zC@d4S$!S#BCE=OBbj=M=QS`2`OI꧓z?SSjb`!b!YI ՠ0`g~:?/ hZɬrY{+Ԯ1vW !*w":^2I<y߄&}IfYFӀѡ/ +![Nvf(/Q- Ej'(Ggzz륨d5hW`MŢmA/2!<Ư9jC,L(Ӱ47 J#X"Lb\lh3+&8A5@/0) N`$(è0u )M"믍{~$2GH!XI=@yϦs33̽_-qĆ]9n,gWFkQqbp;S[ZȳZom![ 0k$T8Ј9NԒk}r!XIdcSbooֹQ>LFb$/֟Vز^3B>s}+R;  TWc2cޭvv.ݳeE,P:ߗP{F%Vf`W+M6j mIj3vaXi,C͔"a6%;\*M[5"gl5dq&U"[tH]v!7lMa=\Xü (T{mc@ʼn'dn/ن•Dvq{viVMlkFK= fG`Vi145UDQ<fYL|(ҳQ 3;%fEG}mѿ{\e1o1$R7mj@`BUKt.&9atֈԆsD݁!`uZ;x@.Xae>q`4PZ#Uẘp,t\{N vZ; ^\ӷl,:!qFsԾHFJNw =ퟍIgQϠYg rK 9H;1.~k`PY(\Bm0i!2 ,$/h䤰U{匘[{dǸkb-ʣ-=7suA|vR>I ߿iø DCX[?l FA!Q&$G*Rq:H8T` N ˴+)oDsk;=5X^P/)>Q[$s/{ ~^[tه%N[{C#HhV2ji LM2N9H  lexRankr/R/data.R0000644000176200001440000000051113213606276013300 0ustar liggesusers#' SMART English Stopwords #' #' English stopwords from the SMART information retrieval system (as documented in Appendix 11 of \url{http://jmlr.csail.mit.edu/papers/volume5/lewis04a/}) #' #' @format a character vector with 571 elements #' #' @source \url{http://jmlr.csail.mit.edu/papers/volume5/lewis04a/} "smart_stopwords" lexRankr/R/RcppExports.R0000644000176200001440000000034513213316732014660 0ustar liggesusers# Generated by using Rcpp::compileAttributes() -> do not edit by hand # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 idfCosineSimil <- function(mat) { .Call('_lexRankr_idfCosineSimil', PACKAGE = 'lexRankr', mat) } lexRankr/R/lexRank.R0000644000176200001440000001307313401570603013773 0ustar liggesusers#' Extractive text summarization with LexRank #' @description Compute LexRanks from a vector of documents using the page rank algorithm or degree centrality the methods used to compute lexRank are discussed in "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization." #' @param text A character vector of documents to be cleaned and processed by the LexRank algorithm #' @param docId A vector of document IDs with length equal to the length of \code{text}. If \code{docId == "create"} then doc IDs will be created as an index from 1 to \code{n}, where \code{n} is the length of \code{text}. #' @param threshold The minimum simil value a sentence pair must have to be represented in the graph where lexRank is calculated. #' @param n The number of sentences to return as the extractive summary. The function will return the top \code{n} lexRanked sentences. See \code{returnTies} for handling ties in lexRank. #' @param returnTies \code{TRUE} or \code{FALSE} indicating whether or not to return greater than \code{n} sentence IDs if there is a tie in lexRank. If \code{TRUE}, the returned number of sentences will not be limited to \code{n}, but rather will return every sentence with a top 3 score. If \code{FALSE}, the returned number of sentences will be \code{<=n}. Defaults to \code{TRUE}. #' @param usePageRank \code{TRUE} or \code{FALSE} indicating whether or not to use the page rank algorithm for ranking sentences. If \code{FALSE}, a sentences unweighted centrality will be used as the rank. Defaults to \code{TRUE}. #' @param damping The damping factor to be passed to page rank algorithm. Ignored if \code{usePageRank} is \code{FALSE}. #' @param continuous \code{TRUE} or \code{FALSE} indicating whether or not to use continuous LexRank. Only applies if \code{usePageRank==TRUE}. If \code{TRUE}, \code{threshold} will be ignored and lexRank will be computed using a weighted graph representation of the sentences. Defaults to \code{FALSE}. #' @param sentencesAsDocs \code{TRUE} or \code{FALSE}, indicating whether or not to treat sentences as documents when calculating tfidf scores for similarity. If \code{TRUE}, inverse document frequency will be calculated as inverse sentence frequency (useful for single document extractive summarization). #' @param removePunc \code{TRUE} or \code{FALSE} indicating whether or not to remove punctuation from text while tokenizing. If \code{TRUE}, punctuation will be removed. Defaults to \code{TRUE}. #' @param removeNum \code{TRUE} or \code{FALSE} indicating whether or not to remove numbers from text while tokenizing. If \code{TRUE}, numbers will be removed. Defaults to \code{TRUE}. #' @param toLower \code{TRUE} or \code{FALSE} indicating whether or not to coerce all of text to lowercase while tokenizing. If \code{TRUE}, \code{text} will be coerced to lowercase. Defaults to \code{TRUE}. #' @param stemWords \code{TRUE} or \code{FALSE} indicating whether or not to stem resulting tokens. If \code{TRUE}, the outputted tokens will be tokenized using \code{SnowballC::wordStem()}. Defaults to \code{TRUE}. #' @param rmStopWords \code{TRUE}, \code{FALSE}, or character vector of stopwords to remove from tokens. If \code{TRUE}, words in \code{lexRankr::smart_stopwords} will be removed prior to stemming. If \code{FALSE}, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to \code{TRUE}. #' @param Verbose \code{TRUE} or \code{FALSE} indicating whether or not to \code{cat} progress messages to the console while running. Defaults to \code{TRUE}. #' @return A 2 column dataframe with columns \code{sentenceId} and \code{value}. \code{sentence} contains the ids of the top \code{n} sentences in descending order by \code{value}. \code{value} contains page rank score (if \code{usePageRank==TRUE}) or degree centrality (if \code{usePageRank==FALSE}). #' @references \url{http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html} #' @examples #' lexRank(c("This is a test.","Tests are fun.", #' "Do you think the exam will be hard?","Is an exam the same as a test?", #' "How many questions are going to be on the exam?")) #' @export lexRank <- function(text, docId = "create", threshold=.2, n=3, returnTies=TRUE, usePageRank=TRUE, damping=0.85, continuous=FALSE, sentencesAsDocs=FALSE, removePunc=TRUE, removeNum=TRUE, toLower=TRUE, stemWords=TRUE, rmStopWords=TRUE, Verbose=TRUE){ if(!is.logical(Verbose)) stop("Verbose must be logical") if(length(Verbose) != 1) stop("Verbose must be length 1") if(Verbose) cat("Parsing text into sentences and tokens...") sentTokList <- sentenceTokenParse(text=text, docId = docId, removePunc=removePunc, removeNum=removeNum, toLower=toLower, stemWords=stemWords, rmStopWords=rmStopWords) if(Verbose) cat("DONE\n") sentDf <- sentTokList$sentences tokenDf <- sentTokList$tokens if(Verbose) cat("Calculating pairwise sentence similarities...") similDf <- sentenceSimil(sentenceId=tokenDf$sentenceId, token=tokenDf$token, docId=tokenDf$docId, sentencesAsDocs=sentencesAsDocs) if(Verbose) cat("DONE\n") if(Verbose) cat("Applying LexRank...") topNSents <- lexRankFromSimil(s1=similDf$sent1, s2=similDf$sent2, simil=similDf$similVal, threshold=threshold, n=n, returnTies=returnTies, usePageRank=usePageRank, damping=damping, continuous=continuous) if(Verbose) cat("DONE\nFormatting Output...") returnDf <- merge(sentDf, topNSents, by="sentenceId") returnDf <- returnDf[order(-returnDf$value), c("docId", "sentenceId", "sentence", "value")] rownames(returnDf) = NULL if(Verbose) cat("DONE\n") return(returnDf) } lexRankr/R/unnest_sentences.R0000644000176200001440000000616213213611534015753 0ustar liggesusers#' Split a column of text into sentences #' @description Split a column of text into sentences #' @param tbl dataframe containing column of text to be split into sentences #' @param output name of column to be created to store parsed sentences #' @param input name of input column of text to be parsed into sentences #' @param doc_id column of document ids; if not provided it will be assumed that each row is a different document #' @param output_id name of column to be created to store sentence ids #' @param drop whether original input column should get dropped #' @return A data.frame of parsed sentences and sentence ids #' @examples #' #' df <- data.frame(doc_id = 1:3, #' text = c("Testing the system. Second sentence for you.", #' "System testing the tidy documents df.", #' "Documents will be parsed and lexranked."), #' stringsAsFactors=FALSE) #' #' unnest_sentences(df, sents, text) #' unnest_sentences_(df, "sents", "text") #' #' \dontrun{ #' library(magrittr) #' #' df %>% #' unnest_sentences(sents, text) #' } #' @export unnest_sentences_ <- function(tbl, output, input, doc_id=NULL, output_id="sent_id", drop=TRUE) { if(!is.data.frame(tbl)) stop("tbl must be a dataframe") if(!(input %in% names(tbl))) stop("input column not found in tbl") if(!is.character(tbl[[input]])) stop("input column must be character") if(length(output_id) > 1) { warning("only first element of output_id will be used") output_id <- output_id[1] } if(!is.logical(drop)) stop("drop must be logical") if(!is.null(doc_id)) { if(!(doc_id %in% names(tbl))) stop("doc_id column not found in tbl") } text <- tbl[[input]] parsed_sents <- sentence_parser(text) if (drop) { tbl[[input]] <- NULL } tbl_out_list <- lapply(seq_along(parsed_sents), function(i) { row_i = tbl[i,,drop=FALSE] parsed_sent_rows_i = data.frame(sent_id = seq_along(parsed_sents[[i]]), sents = parsed_sents[[i]], stringsAsFactors = FALSE) names(parsed_sent_rows_i) = c(output_id, output) out = suppressWarnings(cbind(row_i, parsed_sent_rows_i)) names(out)[seq_along(row_i)] = names(row_i) out }) out_tbl = do.call('rbind', tbl_out_list) if(!is.null(doc_id)) { out_tbl_list = split(out_tbl, out_tbl[[doc_id]]) out_tbl_list = lapply(out_tbl_list, function(dfi) { dfi[[output_id]] = seq_along(dfi[[output_id]]) dfi }) out_tbl = do.call('rbind', out_tbl_list) } rownames(out_tbl) = NULL return(out_tbl) } #' @rdname unnest_sentences_ #' @export unnest_sentences <- function(tbl, output, input, doc_id=NULL, output_id='sent_id', drop=TRUE) { output_str <- as.character(substitute(output)) input_str <- as.character(substitute(input)) out_id_str <- as.character(substitute(output_id)) doc_id <- as.character(substitute(doc_id)) if (length(doc_id) == 0) doc_id = NULL unnest_sentences_(tbl=tbl, output = output_str, input = input_str, doc_id = doc_id, output_id = out_id_str, drop = drop) } lexRankr/R/sentenceSimil.R0000644000176200001440000001007713401570312015167 0ustar liggesusers#' @useDynLib lexRankr #' @importFrom Rcpp sourceCpp NULL #' Compute distance between sentences #' @description Compute distance between sentences using modified idf cosine distance from "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization". Output can be used as input to \code{\link{lexRankFromSimil}}. #' @param sentenceId A character vector of sentence IDs corresponding to the \code{docId} and \code{token} arguments #' @param token A character vector of tokens corresponding to the \code{docId} and \code{sentenceId} arguments #' @param docId A character vector of document IDs corresponding to the \code{sentenceId} and \code{token} arguments. Can be \code{NULL} if \code{sentencesAsDocs} is \code{TRUE}. #' @param sentencesAsDocs \code{TRUE} or \code{FALSE}, indicating whether or not to treat sentences as documents when calculating tfidf scores. If \code{TRUE}, inverse document frequency will be calculated as inverse sentence frequency (useful for single document extractive summarization) #' @return A 3 column dataframe of pairwise distances between sentences. Columns: \code{sent1} (sentence id), \code{sent2} (sentence id), & \code{dist} (distance between \code{sent1} and \code{sent2}). #' @references \url{http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html} #' @examples #' sentenceSimil(docId=c("d1","d1","d2","d2"), #' sentenceId=c("d1_1","d1_1","d2_1","d2_1"), #' token=c("i", "ran", "jane", "ran")) #' @importFrom utils combn #' @importFrom stats xtabs #' @export sentenceSimil <- function(sentenceId, token, docId=NULL, sentencesAsDocs=FALSE){ if(!is.logical(sentencesAsDocs)) stop("sentencesAsDocs must be logical") if(length(sentencesAsDocs) != 1) stop("sentencesAsDocs must be length 1") if(!sentencesAsDocs & is.null(docId)) stop("docIds must be provided if sentencesAsDocs is FALSE") sentenceId <- as.character(sentenceId) if(!is.character(token)) stop("token must be character") if(length(token) < 1) stop("token must be at least length 1") if(sentencesAsDocs) { docId <- sentenceId if(length(docId) != length(sentenceId) | length(docId) != length(token)) stop("docId, sentenceId, & token must all be the same length") } else if (!sentencesAsDocs) { docId <- as.character(docId) if(length(sentenceId) != length(token)) stop("sentenceId & token must be the same length") } ndoc <- length(unique(docId)) if(ndoc > length(unique(sentenceId))) warning("There are more unique docIds than sentenceIds. Verify you have passed the correct parameters to the function.") tokenDf <- data.frame(docId=docId, sentenceId=sentenceId, token=token, stringsAsFactors = FALSE) stmList = split(tokenDf, paste0(tokenDf$docId,tokenDf$token)) stmList = lapply(stmList, function(dfi) { dfi[['tf']] = nrow(dfi) unique(dfi) }) stm = do.call('rbind', stmList) stmList = split(stm, stm$token) stmList = lapply(stmList, function(dfi) { dfi[['idf']] = 1+log(ndoc/length(unique(dfi$docId))) dfi[['tfidf']] = dfi$tf*dfi$idf unique(dfi) }) stm = do.call('rbind', stmList) rownames(stm) = NULL stm = stm[order(stm$docId, stm$token), c("docId", "token", "tf", "idf", "tfidf")] if(!sentencesAsDocs) { stm = merge(tokenDf, stm, by=c("docId","token"), all.x=FALSE, all.y=TRUE) stm = unique(stm[stm$tfidf > 0, c("sentenceId", "token", "tfidf")]) } else if (sentencesAsDocs) { stm = unique(stm[stm$tfidf > 0, c("docId", "token", "tfidf")]) names(stm) = c("sentenceId", "token", "tfidf") } stm = stm[order(stm$sentenceId, stm$token),] if(nrow(stm)==0) stop("All values in sentence term tfidf matrix are 0. Similarities would return as NaN") if(length(unique((stm$sentenceId))) == 1) stop("Only one sentence had nonzero tfidf scores. Similarities would return as NaN") stm = xtabs(tfidf ~ sentenceId + token, stm) sentencePairsDf = as.data.frame(t(combn(sort(rownames(stm)), 2)), stringsAsFactors=FALSE) sentencePairsDf[['similVal']] = idfCosineSimil(stm) names(sentencePairsDf) = c("sent1", "sent2", "similVal") return(sentencePairsDf) } lexRankr/R/sentenceParse.R0000644000176200001440000000435113213610121015154 0ustar liggesusers#' Parse text into sentences #' @description Parse the elements of a character vector into a dataframe of sentences with additional identifiers. #' @param text Character vector to be parsed into sentences #' @param docId A vector of document IDs with length equal to the length of \code{text}. If \code{docId == "create"} then doc IDs will be created as an index from 1 to \code{n}, where \code{n} is the length of \code{text}. #' @return A data frame with 3 columns and \code{n} rows, where \code{n} is the number of sentences found by the routine. Column 1: \code{docId} document id for the sentence. Column 2: \code{sentenceId} sentence id for the sentence. Column 3: \code{sentence} the sentences found in the routine. #' @examples #' sentenceParse("Bill is trying to earn a Ph.D.", "You have to have a 5.0 GPA.") #' sentenceParse(c("Bill is trying to earn a Ph.D.", "You have to have a 5.0 GPA."), #' docId=c("d1","d2")) #' @export sentenceParse <- function(text, docId = "create") { if(!is.character(text)) stop("text must be character") if(length(text) < 1) stop("text must be at least length 1") docId <- as.character(docId) if(length(docId)==1 & docId[1]=="create") { createDocIds <- TRUE } else if(length(docId)==length(text)) { createDocIds <- FALSE } else if(length(docId)!=length(text)) stop("docId vector must be same length as text vector") sentences <- sentence_parser(text) sentenceDfList <- lapply(seq_along(sentences), function(i) { sentVec <- trimws(sentences[[i]]) if (length(sentVec) == 0) sentVec = "" if(createDocIds) { out = data.frame(docId=i, sentenceId=paste0(i,"_",seq_along(sentVec)), sentence=sentVec, stringsAsFactors = FALSE) } else if(!createDocIds) { out = data.frame(docId=docId[i], sentence=sentVec, stringsAsFactors = FALSE) } out }) sentenceDf <- do.call('rbind', sentenceDfList) sentenceDfList <- split(sentenceDf, sentenceDf$docId) sentenceDfList <- lapply(sentenceDfList, function(dfi) { dfi$sentenceId <- paste0(dfi$docId, "_", 1:nrow(dfi)) dfi[,c("docId","sentenceId","sentence")] }) sentenceDf <- do.call('rbind', sentenceDfList) class(sentenceDf) <- "data.frame" rownames(sentenceDf) <- NULL return(sentenceDf) } lexRankr/R/sentence_parser.R0000644000176200001440000000076213213603250015544 0ustar liggesusers#' Utility to parse sentences from text #' @description Utility to parse sentences from text; created to have a central shared sentence parsing function #' @param text Character vector to be parsed into sentences #' @return A list with length equal to `length(text)`; list elements are character vectors of text parsed with sentence regex sentence_parser <- function(text) { strsplit(x = text, split = "(?% unnest_sentences(sents, text) %>% bind_lexrank(sents, doc_id, level = 'sentences') %>% arrange(desc(lexrank)) ``` ## More Examples * [Vignette](https://CRAN.R-project.org/package=lexRankr/vignettes/Analyzing_Twitter_with_LexRankr.html) * [Summarizing Web Articles with R using lexRankr](https://adamspannbauer.github.io/2017/12/17/summarizing-web-articles-with-r/) * [lexRankr & Twitter: find a user's most representative tweets](https://adamspannbauer.github.io/2017/03/09/lexrankr--twitter-find-a-users-most-representative-tweets/) lexRankr/MD50000644000176200001440000000525713443532523012365 0ustar liggesusersaac546afac249557f617c8866ee7ce24 *DESCRIPTION c61f05b227045ce8c01217b2477f7cd5 *LICENSE ca1fbcd11f4630e69bb15aee883f98ab *NAMESPACE 8b4b708f3db2eb0277a2388787b9daa0 *NEWS.md cd64922fc956e9809d4c46842a922106 *R/RcppExports.R 3c4fafcb0976ff9b4c4540edcc574b3f *R/bind_lexrank.R da8cc74ba789622703870b939a5bef28 *R/data.R 9526c8c9cc77f58be3204bba6d6965a4 *R/lexRank.R 602af5d4ad3271553492bed1a15a3a07 *R/lexRankFromSimil.R 1a1841aa1111265bf6082f11132f88ed *R/sentenceParse.R 4ff5f2ecea693618f4e5aeeae26183bf *R/sentenceSimil.R d233dfa75ea27fceaf439e2c77037a15 *R/sentenceTokenParse.R e53e6847978319a59bed3450903419b1 *R/sentence_parser.R 3b4ae6fb3e4b48d1f59c4e166ba8d363 *R/sysdata.rda c40293d5e143ebec7f0531a160fb3b8f *R/tokenize.R 4f9bfeba05ba23790191e14280506981 *R/unnest_sentences.R 06e3bda580cf9042267921acaabf7855 *README.md 955fdc530bda1d71bfe9b893c66844ae *build/vignette.rds 3b4ae6fb3e4b48d1f59c4e166ba8d363 *data/smart_stopwords.rda f34990950f88340832b2e02b568ceea7 *inst/doc/Analyzing_Twitter_with_LexRankr.html cfbad0582a84358f6d4ab3b035941972 *inst/doc/Analyzing_Twitter_with_LexRankr.html.asis 5beb2dcb26a1dd0f0524144f6eb1ae65 *man/bind_lexrank_.Rd e012a599340b7fe9033b6079f7c570d5 *man/lexRank.Rd 9e85e9bfc5f0c65028e5e027d092389c *man/lexRankFromSimil.Rd 0b719b51c1ddc2d5ee42dfe85b898ae2 *man/sentenceParse.Rd 5d3b96eff5f2ef107f790f83af7d6149 *man/sentenceSimil.Rd 09d182b7e8d773f535c7e4b7d4d94e72 *man/sentenceTokenParse.Rd db32bcac0cf42df74a5039bc5bc958d0 *man/sentence_parser.Rd d2758e4e6a15ac071ce896656478a2cb *man/smart_stopwords.Rd 961f6c3d3a908c437931727a28c225d7 *man/tokenize.Rd 48dda41f11d6bb9558917d8d38fbb5a4 *man/unnest_sentences_.Rd 8cdcce9b4a60f8d8bb1802fbcd9f615b *src/RcppExports.cpp 2cd12945f1ba427de73a8b690ffd4615 *src/idfCosineSimil.cpp 9c59a85a10ea7c6b21167c86763f972d *src/register_routines.c 47125b04d3f431060fb4a9ff5ee64747 *tests/testthat.R 4589e471cc494c7af261b8ad39201caa *tests/testthat/test-bind_lexrank.R f3eb1d5d45b8bbca2e93b481ce89b561 *tests/testthat/test-bind_lexrank_.R dc393e89616ecf8af6e7c3a31751d144 *tests/testthat/test-idfCosine.R 6dce41102accc50f0055debf2d4f8e2d *tests/testthat/test-lexRank.R 2586ca6a83c8c4152097c6d02438be8d *tests/testthat/test-lexRankFromSimil.R 3dc6efa951a72a397a72507f12b2906f *tests/testthat/test-sentenceParse.R 7514a11b5b5ab14239a0f9163330c450 *tests/testthat/test-sentenceSimil.R 7bca719093147a66ad4acac2d8652cc1 *tests/testthat/test-sentenceTokenParse.R b9e9f433828bbcca0f22d5456940178c *tests/testthat/test-tokenize.R efb17c08f184ff1e886b3ba46eb851a7 *tests/testthat/test-unnest_sentences.R 2d4fb1ace9b7af67d010071d9d859294 *tests/testthat/test-unnest_sentences_.R cfbad0582a84358f6d4ab3b035941972 *vignettes/Analyzing_Twitter_with_LexRankr.html.asis lexRankr/build/0000755000176200001440000000000013443530263013142 5ustar liggesuserslexRankr/build/vignette.rds0000644000176200001440000000036013443530263015500 0ustar liggesusersP 0NӪ >\\ VFLV'\F\㽥X1Δk15& %tMَhb8ȯv̄Zw/&VT&nŞwJV;D;fX׀JےE lexRankr/DESCRIPTION0000644000176200001440000000143513443532523013555 0ustar liggesusersPackage: lexRankr Type: Package Title: Extractive Summarization of Text with the LexRank Algorithm Version: 0.5.2 Author: Adam Spannbauer [aut, cre], Bryan White [ctb] Maintainer: Adam Spannbauer Description: An R implementation of the LexRank algorithm described by G. Erkan and D. R. Radev (2004) . License: MIT + file LICENSE URL: https://github.com/AdamSpannbauer/lexRankr/ BugReports: https://github.com/AdamSpannbauer/lexRankr/issues/ LazyData: TRUE RoxygenNote: 6.1.1 Imports: SnowballC, igraph, Rcpp Depends: R (>= 2.10) LinkingTo: Rcpp Suggests: covr, testthat, R.rsp VignetteBuilder: R.rsp Encoding: UTF-8 NeedsCompilation: yes Packaged: 2019-03-17 20:40:20 UTC; adamspannbauer Repository: CRAN Date/Publication: 2019-03-17 21:00:03 UTC lexRankr/man/0000755000176200001440000000000013213604623012613 5ustar liggesuserslexRankr/man/bind_lexrank_.Rd0000644000176200001440000000673013213611447015711 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/bind_lexrank.R \name{bind_lexrank_} \alias{bind_lexrank_} \alias{bind_lexrank} \title{Bind lexrank scores to a dataframe of text} \usage{ bind_lexrank_(tbl, text, doc_id, sent_id = NULL, level = c("sentences", "tokens"), threshold = 0.2, usePageRank = TRUE, damping = 0.85, continuous = FALSE, ...) bind_lexrank(tbl, text, doc_id, sent_id = NULL, level = c("sentences", "tokens"), threshold = 0.2, usePageRank = TRUE, damping = 0.85, continuous = FALSE, ...) } \arguments{ \item{tbl}{dataframe containing column of sentences to be lexranked} \item{text}{name of column containing sentences or tokens to be lexranked} \item{doc_id}{name of column containing document ids corresponding to \code{text}} \item{sent_id}{Only needed if \code{level} is "tokens". name of column containing sentence ids corresponding to \code{text}} \item{level}{the parsed level of the text column to be lexranked. i.e. is \code{text} a column of "sentences" or "tokens"? The "tokens" level is provided to allow users to implement custom tokenization. Note: even if the input \code{level} is "tokens" lexrank scores are assigned at the sentence level.} \item{threshold}{The minimum simililarity value a sentence pair must have to be represented in the graph where lexRank is calculated.} \item{usePageRank}{\code{TRUE} or \code{FALSE} indicating whether or not to use the page rank algorithm for ranking sentences. If \code{FALSE}, a sentences unweighted centrality will be used as the rank. Defaults to \code{TRUE}.} \item{damping}{The damping factor to be passed to page rank algorithm. Ignored if \code{usePageRank} is \code{FALSE}.} \item{continuous}{\code{TRUE} or \code{FALSE} indicating whether or not to use continuous LexRank. Only applies if \code{usePageRank==TRUE}. If \code{TRUE}, \code{threshold} will be ignored and lexRank will be computed using a weighted graph representation of the sentences. Defaults to \code{FALSE}.} \item{...}{tokenizing options to be passed to lexRankr::tokenize. Ignored if \code{level} is "sentences"} } \value{ A dataframe with an additional column of lexrank scores (column is given name lexrank) } \description{ Bind lexrank scores to a dataframe of sentences or to a dataframe of tokens with sentence ids } \examples{ df <- data.frame(doc_id = 1:3, text = c("Testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors = FALSE) \dontrun{ library(magrittr) df \%>\% unnest_sentences(sents, text) \%>\% bind_lexrank(sents, doc_id, level = "sentences") df \%>\% unnest_sentences(sents, text) \%>\% bind_lexrank_("sents", "doc_id", level = "sentences") df <- data.frame(doc_id = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3), sent_id = c(1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), tokens = c("testing", "the", "system", "second", "sentence", "for", "you", "system", "testing", "the", "tidy", "documents", "df", "documents", "will", "be", "parsed", "and", "lexranked"), stringsAsFactors = FALSE) df \%>\% bind_lexrank(tokens, doc_id, sent_id, level = 'tokens') } } lexRankr/man/sentence_parser.Rd0000644000176200001440000000102213177136432016264 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/sentence_parser.R \name{sentence_parser} \alias{sentence_parser} \title{Utility to parse sentences from text} \usage{ sentence_parser(text) } \arguments{ \item{text}{Character vector to be parsed into sentences} } \value{ A list with length equal to `length(text)`; list elements are character vectors of text parsed with sentence regex } \description{ Utility to parse sentences from text; created to have a central shared sentence parsing function } lexRankr/man/sentenceParse.Rd0000644000176200001440000000215713177136432015715 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/sentenceParse.R \name{sentenceParse} \alias{sentenceParse} \title{Parse text into sentences} \usage{ sentenceParse(text, docId = "create") } \arguments{ \item{text}{Character vector to be parsed into sentences} \item{docId}{A vector of document IDs with length equal to the length of \code{text}. If \code{docId == "create"} then doc IDs will be created as an index from 1 to \code{n}, where \code{n} is the length of \code{text}.} } \value{ A data frame with 3 columns and \code{n} rows, where \code{n} is the number of sentences found by the routine. Column 1: \code{docId} document id for the sentence. Column 2: \code{sentenceId} sentence id for the sentence. Column 3: \code{sentence} the sentences found in the routine. } \description{ Parse the elements of a character vector into a dataframe of sentences with additional identifiers. } \examples{ sentenceParse("Bill is trying to earn a Ph.D.", "You have to have a 5.0 GPA.") sentenceParse(c("Bill is trying to earn a Ph.D.", "You have to have a 5.0 GPA."), docId=c("d1","d2")) } lexRankr/man/sentenceSimil.Rd0000644000176200001440000000321013401570660015702 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/sentenceSimil.R \name{sentenceSimil} \alias{sentenceSimil} \title{Compute distance between sentences} \usage{ sentenceSimil(sentenceId, token, docId = NULL, sentencesAsDocs = FALSE) } \arguments{ \item{sentenceId}{A character vector of sentence IDs corresponding to the \code{docId} and \code{token} arguments} \item{token}{A character vector of tokens corresponding to the \code{docId} and \code{sentenceId} arguments} \item{docId}{A character vector of document IDs corresponding to the \code{sentenceId} and \code{token} arguments. Can be \code{NULL} if \code{sentencesAsDocs} is \code{TRUE}.} \item{sentencesAsDocs}{\code{TRUE} or \code{FALSE}, indicating whether or not to treat sentences as documents when calculating tfidf scores. If \code{TRUE}, inverse document frequency will be calculated as inverse sentence frequency (useful for single document extractive summarization)} } \value{ A 3 column dataframe of pairwise distances between sentences. Columns: \code{sent1} (sentence id), \code{sent2} (sentence id), & \code{dist} (distance between \code{sent1} and \code{sent2}). } \description{ Compute distance between sentences using modified idf cosine distance from "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization". Output can be used as input to \code{\link{lexRankFromSimil}}. } \examples{ sentenceSimil(docId=c("d1","d1","d2","d2"), sentenceId=c("d1_1","d1_1","d2_1","d2_1"), token=c("i", "ran", "jane", "ran")) } \references{ \url{http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html} } lexRankr/man/smart_stopwords.Rd0000644000176200001440000000077313213604707016366 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/data.R \docType{data} \name{smart_stopwords} \alias{smart_stopwords} \title{SMART English Stopwords} \format{a character vector with 571 elements} \source{ \url{http://jmlr.csail.mit.edu/papers/volume5/lewis04a/} } \usage{ smart_stopwords } \description{ English stopwords from the SMART information retrieval system (as documented in Appendix 11 of \url{http://jmlr.csail.mit.edu/papers/volume5/lewis04a/}) } \keyword{datasets} lexRankr/man/unnest_sentences_.Rd0000644000176200001440000000265313443525405016637 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/unnest_sentences.R \name{unnest_sentences_} \alias{unnest_sentences_} \alias{unnest_sentences} \title{Split a column of text into sentences} \usage{ unnest_sentences_(tbl, output, input, doc_id = NULL, output_id = "sent_id", drop = TRUE) unnest_sentences(tbl, output, input, doc_id = NULL, output_id = "sent_id", drop = TRUE) } \arguments{ \item{tbl}{dataframe containing column of text to be split into sentences} \item{output}{name of column to be created to store parsed sentences} \item{input}{name of input column of text to be parsed into sentences} \item{doc_id}{column of document ids; if not provided it will be assumed that each row is a different document} \item{output_id}{name of column to be created to store sentence ids} \item{drop}{whether original input column should get dropped} } \value{ A data.frame of parsed sentences and sentence ids } \description{ Split a column of text into sentences } \examples{ df <- data.frame(doc_id = 1:3, text = c("Testing the system. Second sentence for you.", "System testing the tidy documents df.", "Documents will be parsed and lexranked."), stringsAsFactors=FALSE) unnest_sentences(df, sents, text) unnest_sentences_(df, "sents", "text") \dontrun{ library(magrittr) df \%>\% unnest_sentences(sents, text) } } lexRankr/man/lexRankFromSimil.Rd0000644000176200001440000000544713443525405016347 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/lexRankFromSimil.R \name{lexRankFromSimil} \alias{lexRankFromSimil} \title{Compute LexRanks from pairwise sentence similarities} \usage{ lexRankFromSimil(s1, s2, simil, threshold = 0.2, n = 3, returnTies = TRUE, usePageRank = TRUE, damping = 0.85, continuous = FALSE) } \arguments{ \item{s1}{A character vector of sentence IDs corresponding to the \code{s2} and \code{simil} arguments} \item{s2}{A character vector of sentence IDs corresponding to the \code{s1} and \code{simil} arguments} \item{simil}{A numeric vector of similarity values that represents the similarity between the sentences represented by the IDs in \code{s1} and \code{s2}.} \item{threshold}{The minimum simil value a sentence pair must have to be represented in the graph where lexRank is calculated.} \item{n}{The number of sentences to return as the extractive summary. The function will return the top \code{n} lexRanked sentences. See \code{returnTies} for handling ties in lexRank.} \item{returnTies}{\code{TRUE} or \code{FALSE} indicating whether or not to return greater than \code{n} sentence IDs if there is a tie in lexRank. If \code{TRUE}, the returned number of sentences will not be limited to \code{n}, but rather will return every sentence with a top 3 score. If \code{FALSE}, the returned number of sentences will be \code{<=n}. Defaults to \code{TRUE}.} \item{usePageRank}{\code{TRUE} or \code{FALSE} indicating whether or not to use the page rank algorithm for ranking sentences. If \code{FALSE}, a sentences unweighted centrality will be used as the rank. Defaults to \code{TRUE}.} \item{damping}{The damping factor to be passed to page rank algorithm. Ignored if \code{usePageRank} is \code{FALSE}.} \item{continuous}{\code{TRUE} or \code{FALSE} indicating whether or not to use continuous LexRank. Only applies if \code{usePageRank==TRUE}. If \code{TRUE}, \code{threshold} will be ignored and lexRank will be computed using a weighted graph representation of the sentences. Defaults to \code{FALSE}.} } \value{ A 2 column dataframe with columns \code{sentenceId} and \code{value}. \code{sentenceId} contains the ids of the top \code{n} sentences in descending order by \code{value}. \code{value} contains page rank score (if \code{usePageRank==TRUE}) or degree centrality (if \code{usePageRank==FALSE}). } \description{ Compute LexRanks from sentence pair similarities using the page rank algorithm or degree centrality the methods used to compute lexRank are discussed in "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization." } \examples{ lexRankFromSimil(s1=c("d1_1","d1_1","d1_2"), s2=c("d1_2","d2_1","d2_1"), simil=c(.01,.03,.5)) } \references{ \url{http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html} } lexRankr/man/sentenceTokenParse.Rd0000644000176200001440000000475213401570747016722 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/sentenceTokenParse.R \name{sentenceTokenParse} \alias{sentenceTokenParse} \title{Parse text into sentences and tokens} \usage{ sentenceTokenParse(text, docId = "create", removePunc = TRUE, removeNum = TRUE, toLower = TRUE, stemWords = TRUE, rmStopWords = TRUE) } \arguments{ \item{text}{A character vector of documents to be parsed into sentences and tokenized.} \item{docId}{A character vector of document Ids the same length as \code{text}. If \code{docId=="create"} document Ids will be created.} \item{removePunc}{\code{TRUE} or \code{FALSE} indicating whether or not to remove punctuation from \code{text} while tokenizing. If \code{TRUE}, punctuation will be removed. Defaults to \code{TRUE}.} \item{removeNum}{\code{TRUE} or \code{FALSE} indicating whether or not to remove numbers from \code{text} while tokenizing. If \code{TRUE}, numbers will be removed. Defaults to \code{TRUE}.} \item{toLower}{\code{TRUE} or \code{FALSE} indicating whether or not to coerce all of \code{text} to lowercase while tokenizing. If \code{TRUE}, \code{text} will be coerced to lowercase. Defaults to \code{TRUE}.} \item{stemWords}{\code{TRUE} or \code{FALSE} indicating whether or not to stem resulting tokens. If \code{TRUE}, the outputted tokens will be tokenized using \code{SnowballC::wordStem()}. Defaults to \code{TRUE}.} \item{rmStopWords}{\code{TRUE}, \code{FALSE}, or character vector of stopwords to remove from tokens. If \code{TRUE}, words in \code{lexRankr::smart_stopwords} will be removed prior to stemming. If \code{FALSE}, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to \code{TRUE}.} } \value{ A list of dataframes. The first element of the list returned is the \code{sentences} dataframe; this dataframe has columns \code{docId}, \code{sentenceId}, & \code{sentence} (the actual text of the sentence). The second element of the list returned is the \code{tokens} dataframe; this dataframe has columns \code{docId}, \code{sentenceId}, & \code{token} (the actual text of the token). } \description{ Parse a character vector of documents into into both sentences and a clean vector of tokens. The resulting output includes IDs for document and sentence for use in other \code{lexRank} functions. } \examples{ sentenceTokenParse(c("Bill is trying to earn a Ph.D.", "You have to have a 5.0 GPA."), docId=c("d1","d2")) } lexRankr/man/tokenize.Rd0000644000176200001440000000347513401570747014753 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/tokenize.R \name{tokenize} \alias{tokenize} \title{Tokenize a character vector Parse the elements of a character vector into a list of cleaned tokens.} \usage{ tokenize(text, removePunc = TRUE, removeNum = TRUE, toLower = TRUE, stemWords = TRUE, rmStopWords = TRUE) } \arguments{ \item{text}{The character vector to be tokenized} \item{removePunc}{\code{TRUE} or \code{FALSE} indicating whether or not to remove punctuation from \code{text}. If \code{TRUE}, punctuation will be removed. Defaults to \code{TRUE}.} \item{removeNum}{\code{TRUE} or \code{FALSE} indicating whether or not to remove numbers from \code{text}. If \code{TRUE}, numbers will be removed. Defaults to \code{TRUE}.} \item{toLower}{\code{TRUE} or \code{FALSE} indicating whether or not to coerce all of \code{text} to lowercase. If \code{TRUE}, \code{text} will be coerced to lowercase. Defaults to \code{TRUE}.} \item{stemWords}{\code{TRUE} or \code{FALSE} indicating whether or not to stem resulting tokens. If \code{TRUE}, the outputted tokens will be tokenized using \code{SnowballC::wordStem()}. Defaults to \code{TRUE}.} \item{rmStopWords}{\code{TRUE}, \code{FALSE}, or character vector of stopwords to remove. If \code{TRUE}, words in \code{lexRankr::smart_stopwords} will be removed prior to stemming. If \code{FALSE}, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to \code{TRUE}.} } \description{ Tokenize a character vector Parse the elements of a character vector into a list of cleaned tokens. } \examples{ tokenize("Mr. Feeny said the test would be on Sat. At least I'm 99.9\% sure that's what he said.") tokenize("Bill is trying to earn a Ph.D. in his field.", rmStopWords=FALSE) } lexRankr/man/lexRank.Rd0000644000176200001440000001101513401570660014506 0ustar liggesusers% Generated by roxygen2: do not edit by hand % Please edit documentation in R/lexRank.R \name{lexRank} \alias{lexRank} \title{Extractive text summarization with LexRank} \usage{ lexRank(text, docId = "create", threshold = 0.2, n = 3, returnTies = TRUE, usePageRank = TRUE, damping = 0.85, continuous = FALSE, sentencesAsDocs = FALSE, removePunc = TRUE, removeNum = TRUE, toLower = TRUE, stemWords = TRUE, rmStopWords = TRUE, Verbose = TRUE) } \arguments{ \item{text}{A character vector of documents to be cleaned and processed by the LexRank algorithm} \item{docId}{A vector of document IDs with length equal to the length of \code{text}. If \code{docId == "create"} then doc IDs will be created as an index from 1 to \code{n}, where \code{n} is the length of \code{text}.} \item{threshold}{The minimum simil value a sentence pair must have to be represented in the graph where lexRank is calculated.} \item{n}{The number of sentences to return as the extractive summary. The function will return the top \code{n} lexRanked sentences. See \code{returnTies} for handling ties in lexRank.} \item{returnTies}{\code{TRUE} or \code{FALSE} indicating whether or not to return greater than \code{n} sentence IDs if there is a tie in lexRank. If \code{TRUE}, the returned number of sentences will not be limited to \code{n}, but rather will return every sentence with a top 3 score. If \code{FALSE}, the returned number of sentences will be \code{<=n}. Defaults to \code{TRUE}.} \item{usePageRank}{\code{TRUE} or \code{FALSE} indicating whether or not to use the page rank algorithm for ranking sentences. If \code{FALSE}, a sentences unweighted centrality will be used as the rank. Defaults to \code{TRUE}.} \item{damping}{The damping factor to be passed to page rank algorithm. Ignored if \code{usePageRank} is \code{FALSE}.} \item{continuous}{\code{TRUE} or \code{FALSE} indicating whether or not to use continuous LexRank. Only applies if \code{usePageRank==TRUE}. If \code{TRUE}, \code{threshold} will be ignored and lexRank will be computed using a weighted graph representation of the sentences. Defaults to \code{FALSE}.} \item{sentencesAsDocs}{\code{TRUE} or \code{FALSE}, indicating whether or not to treat sentences as documents when calculating tfidf scores for similarity. If \code{TRUE}, inverse document frequency will be calculated as inverse sentence frequency (useful for single document extractive summarization).} \item{removePunc}{\code{TRUE} or \code{FALSE} indicating whether or not to remove punctuation from text while tokenizing. If \code{TRUE}, punctuation will be removed. Defaults to \code{TRUE}.} \item{removeNum}{\code{TRUE} or \code{FALSE} indicating whether or not to remove numbers from text while tokenizing. If \code{TRUE}, numbers will be removed. Defaults to \code{TRUE}.} \item{toLower}{\code{TRUE} or \code{FALSE} indicating whether or not to coerce all of text to lowercase while tokenizing. If \code{TRUE}, \code{text} will be coerced to lowercase. Defaults to \code{TRUE}.} \item{stemWords}{\code{TRUE} or \code{FALSE} indicating whether or not to stem resulting tokens. If \code{TRUE}, the outputted tokens will be tokenized using \code{SnowballC::wordStem()}. Defaults to \code{TRUE}.} \item{rmStopWords}{\code{TRUE}, \code{FALSE}, or character vector of stopwords to remove from tokens. If \code{TRUE}, words in \code{lexRankr::smart_stopwords} will be removed prior to stemming. If \code{FALSE}, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to \code{TRUE}.} \item{Verbose}{\code{TRUE} or \code{FALSE} indicating whether or not to \code{cat} progress messages to the console while running. Defaults to \code{TRUE}.} } \value{ A 2 column dataframe with columns \code{sentenceId} and \code{value}. \code{sentence} contains the ids of the top \code{n} sentences in descending order by \code{value}. \code{value} contains page rank score (if \code{usePageRank==TRUE}) or degree centrality (if \code{usePageRank==FALSE}). } \description{ Compute LexRanks from a vector of documents using the page rank algorithm or degree centrality the methods used to compute lexRank are discussed in "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization." } \examples{ lexRank(c("This is a test.","Tests are fun.", "Do you think the exam will be hard?","Is an exam the same as a test?", "How many questions are going to be on the exam?")) } \references{ \url{http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html} } lexRankr/LICENSE0000644000176200001440000000005513177136432013054 0ustar liggesusersYEAR: 2016 COPYRIGHT HOLDER: Adam Spannbauer