What does remove_punctuation_vector do exactly? #8

hanson1005 · 2019-02-06T03:46:40Z

Hi. I am following the following code you provided but using my own data:
https://cran.r-project.org/web/packages/textTinyR/vignettes/word_vectors_doc2vec.html

When the following commands are executed, if I change remove_punctuation_vector from FALSE to TRUE (in your vignette, it was FALSE), then the "output_token_single_file.txt" generated significantly reduces in size from 982,000KB to 258,000KB:

clust_vec = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T, to_lower = T, remove_punctuation_vector = T, .....)

utl = textTinyR::sparse_term_matrix$new(vector_data = concat, file_data = NULL, document_term_matrix = TRUE) tm = utl$Term_Matrix(sort_terms = FALSE, to_lower = T, remove_punctuation_vector = T, ....)

Also, when building word2vec by running the following code, the progress output shows that it reads 145M words after setting "remove_punctuation_vector=F" while it reads only 39M words after setting "remove_punctuation_vector=T" in earlier phase:

vecs= fastTextR::skipgram_cbow(input_path = PATH_INPUT, output_path = PATH_OUT, ... )

So, my question is, isn't it that the remove_punctuation_vector simply removes punctuation from the text? Why does the number of words in corpus reduce into less than one third only with this change?
What does this option really remove?

The text was updated successfully, but these errors were encountered:

mlampros · 2019-02-06T07:56:29Z

hi @hanson1005,

in the documentation of the textTinyR package and especially for the tokenize_transform_vec_docs function (but not only for this specific function) I added the following,

'remove_punctuation_string'  either TRUE or FALSE. If TRUE then the punctuation of the character string will be removed (applies before the split function)

'remove_punctuation_vector'  either TRUE or FALSE. If TRUE then the punctuation of the vector of the character strings will be removed (after the string split has taken place)

The difference between the two (remove_punctuation_string and remove_punctuation_vector) has to do with the 'split_string' parameter,

'split_string'  either TRUE or FALSE. If TRUE then the character string will be split using the split_separator as delimiter. The user can also specify multiple delimiters.

but with the 'split_separator' and 'as_token' parameters as well.

All these parameters intersect depending on the use case. So, it could be the case that you first want to split a character string (sentence) and then remove the punctuation. On the other case you might be interested in removing the punctuation without splitting the character string.
You can experiment with those parameters using the example I've added in the documentation of the tokenize_transform_vec_docs (be aware that when you use for instance the period ('.') special character in the 'split_separator' then the string will be split in this part of the character string and the special character will not appear in the output),

library(textTinyR)

token_doc_vec = c("CONVERT to low....er", "remove.. punctuation11234", "trim token and split ")

res = tokenize_transform_vec_docs(object = token_doc_vec, 
                                  as_token = TRUE,
                                  to_lower = TRUE, 
                                  split_string = TRUE,
                                  split_separator = " \r\n\t,;:()?!//", 
                                  remove_punctuation_string = TRUE,          # FALSE
                                  remove_punctuation_vector = FALSE)        # TRUE

res

To your second point about the reduction of the output size. It has to do with the number of special characters in the input file because the remove_punctuation_vector parameter will remove all special characters of the input data.

mlampros · 2019-02-07T08:46:55Z

hi @hanson1005 ,

textTinyR (in comparison to my other R packages) has many parameters, so I took a second look to this issue to find out if there is a bug. I run the textTinyR::tokenize_transform_vec_docs() function with the parameter remove_punctuation_vector set to either FALSE or TRUE and I also saved the vocabulary to a file because I can upload it in R and observe the character strings along with their counts,

save_dat = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T, 
                                                  to_lower = T, 
                                                  remove_punctuation_vector = FALSE,
                                                  remove_numbers = F, trim_token = T, 
                                                  split_string = T, 
                                                  split_separator = " \r\n\t.,;:()?!//",
                                                  remove_stopwords = T, language = "english", 
                                                  min_num_char = 3, max_num_char = 100, 
                                                  stemmer = "porter2_stemmer", 
                                                  path_2folder = "/path_to_your_folder/",
                                                  vocabulary_path_file = "/path_to_your_folder/vocab.txt",
                                                  threads = 6, verbose = T)

punct1 = unlist(save_dat$token)

save_dat2 = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T, 
                                                  to_lower = T, 
                                                  remove_punctuation_vector = TRUE,
                                                  remove_numbers = F, trim_token = T, 
                                                  split_string = T, 
                                                  split_separator = " \r\n\t.,;:()?!//",
                                                  remove_stopwords = T, language = "english", 
                                                  min_num_char = 3, max_num_char = 100, 
                                                  stemmer = "porter2_stemmer", 
                                                  path_2folder = "/path_to_your_folder2/",
                                                  vocabulary_path_file = "/path_to_your_folder2/vocab.txt",
                                                  threads = 6, verbose = T)


punct2 = unlist(save_dat2$token)

If you load the two "output_token_single_file.txt" from the two different folders you can count the number of characters that appear in each one using the 'readLines' and 'nchar' base functions,

fileName <- '/path_to_your_folder/output_token_single_file.txt'
con <- file(fileName,open="r")
line1 <- readLines(con)
close(con)

str(line1)
nchar(line1)

[1] 4964786


fileName <- '/path_to_your_folder2/output_token_single_file.txt'
con <- file(fileName,open="r")
line2 <- readLines(con)
close(con)

str(line2)
nchar(line2)

[1] 4878736

Therefore by setting "remove_punctuation_vector" to FALSE I receive 4964786 characters whereas when I set "remove_punctuation_vector" to TRUE I receive 4878736 characters. This is actually a difference of approx. 2 % ( (1.0 - (4878736 / 4964786) ) * 100 ). Also the first file has a size of 5.0 MB whereas the file from the second folder has size 4.9 MB, which I think is correct taking into consideration that the second file does not have special characters included.

You should also know that R package authors are not allowed based on the CRAN submission policy to modify a users workspace. Previous "output_token_single_file.txt" files are not automatically removed, that means each time you run one of the functions which save an "output_token_single_file.txt" to your target folder you have to remove the previous file otherwise the new data will be added at the end of the file increasing that way the size of the file. Can you please run the previous code snippets by first removing any previously saved files?

Moreover, on which operating system do you use the textTinyR package?

hanson1005 · 2019-02-08T03:46:42Z

I am using it on Windows 10. The last point you made solved my problem. I removed existing "output_token_single_file.txt" and created one with and without punctuation, and both files generated similar number of words and characters. However, there are some other dubious things going on that I newly discovered. First, every time I run "save_dat = textTinyR::tokenize_transform_vec_docs" code, the resulting "output_token_single_file.txt" has slightly different number of characters, number of words and different size in KB. Is there any stochastic feature in this? I assume that there shouldn't be, but there really is. I tried the same code for five times, and all five cases have different file size. Some have the same number of words, but others don't. And all five have different number of characters. So, I tried set.seed() to see if that solves the randomness but it didn't. Second, following your code that counts the number of characters, the "output_token_single_file.txt" generated WITHOUT punctuation had MORE characters than the one WITH punctuation.This is the second weird thing that I don't understand. Third, in terms of the number of words (which is printed when running "fastTextR::skipgram_cbow" command), however, the one WITHOUT punctuation contains LESS words than the one WITH punctuation. Why does the number of words change when I remove only the punctuation?

…

On Thu, Feb 7, 2019 at 2:46 AM Lampros Mouselimis ***@***.***> wrote: hi @hanson1005 <https://github.com/hanson1005> , textTinyR (in comparison to my other R packages) has many parameters, so I took a second look to this issue to find out if there is a bug. I run the *textTinyR::tokenize_transform_vec_docs()* function with the parameter *remove_punctuation_vector* set to either FALSE or TRUE and I also saved the vocabulary to a file because I can upload it in R and observe the character strings along with their counts, save_dat = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T, to_lower = T, remove_punctuation_vector = FALSE, remove_numbers = F, trim_token = T, split_string = T, split_separator = " \r\n\t.,;:()?!//", remove_stopwords = T, language = "english", min_num_char = 3, max_num_char = 100, stemmer = "porter2_stemmer", path_2folder = "/path_to_your_folder/", vocabulary_path_file = "/path_to_your_folder/vocab.txt", threads = 6, verbose = T) punct1 = unlist(save_dat$token) save_dat2 = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T, to_lower = T, remove_punctuation_vector = TRUE, remove_numbers = F, trim_token = T, split_string = T, split_separator = " \r\n\t.,;:()?!//", remove_stopwords = T, language = "english", min_num_char = 3, max_num_char = 100, stemmer = "porter2_stemmer", path_2folder = "/path_to_your_folder2/", vocabulary_path_file = "/path_to_your_folder2/vocab.txt", threads = 6, verbose = T) punct2 = unlist(save_dat2$token) If you load the two "output_token_single_file.txt" from the two different folders you can count the number of characters that appear in each one using the 'readLines' and 'nchar' base functions, fileName <- '/path_to_your_folder/output_token_single_file.txt'con <- file(fileName,open="r")line1 <- readLines(con) close(con) str(line1) nchar(line1) [1] 4964786 fileName <- '/path_to_your_folder2/output_token_single_file.txt'con <- file(fileName,open="r")line2 <- readLines(con) close(con) str(line2) nchar(line2) [1] 4878736 Therefore by setting "remove_punctuation_vector" to FALSE I receive 4964786 characters whereas when I set "remove_punctuation_vector" to TRUE I receive 4878736 characters. This is actually a difference of approx. 2 % ( (1.0 - (4878736 / 4964786) ) * 100 ). Also the first file has a size of 5.0 MB whereas the file from the second folder has size 4.9 MB, which I think is correct taking into consideration that the second file does not have special characters included. You should also know that R package authors are *not allowed* based on the CRAN submission policy <https://cran.r-project.org/web/packages/policies.html> to modify a users workspace. Previous "output_token_single_file.txt" files are not automatically removed, that means each time you run one of the functions which save an "output_token_single_file.txt" to your target folder you have to remove the previous file otherwise the new data will be added at the end of the file increasing that way the size of the file. Can you please run the previous code snippets by first removing any previously saved files? Moreover, on which operating system do you use the textTinyR package? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASRWrurgNaStpmpLIFZ22GlnlBsuP0lGks5vK-gAgaJpZM4akgRL> .

-- Ju Yeon Julia Park Ph.D. in Politics New York University

mlampros · 2019-02-08T11:03:13Z

hi @hanson1005,

thanks for making me aware of this issue. It's a bug related to the windows OS due to the OpenMP parallelization. This issue does not appear on Linux (I haven't tested it on Macintosh yet).
I'll upload the updated version tomorrow on the Github repository. To receive the same results for multiple runs you must set the 'threads' parameter to 1. For testing purposes you can use the following code snippet based on the test_text.txt data of the tests folder (you can adjust it also to your own data set) . There are two cases : 1st. with or without parallelization, 2nd. with or without removal of punctuation (adjust it every time you run the for loop). Moreover, on windows OS the 'unlink' base R function does not seem to work, so at each run you must delete the previously created folders inside the 'folder_for_all_out' folder. This for-loop counts the KB's and characters of the output files based on the uncommented variables ('THREADS_' and 'remove_punct'),

PATH = 'test_text.txt'

text_rows = textTinyR::read_rows(input_file = PATH)$data

iters = parallel::detectCores()

folder_for_all_out = 'save_folder'          # specify a folder to save the output-files

K_BYTES = CHARS = rep(NA, iters)
lst_vocab_terms_values = list()


for (i in 1:iters) {
  
  #-------------- 1st. case
  THREADS_ = 1
  # THREADS_ = i
  #--------------
  
  #------------------- 2nd. case
  remove_punct = TRUE
  # remove_punct = FALSE
  #-------------------  

  if (.Platform$OS.type == "windows") {
    ext = "\\"
  }
  if (.Platform$OS.type == "unix") {
    ext = "/"
  }

  tmp_out_dir = paste0(file.path(folder_for_all_out, i), ext)
  vocab_file = file.path(folder_for_all_out, i, 'VOCAB.txt')
  
  #---------------------------------------------- on linux
  if (.Platform$OS.type == "unix") {
     if (dir.exists(tmp_out_dir)) {
       unlink(tmp_out_dir, recursive = T)
       dir.create(tmp_out_dir)
     }
     else {
       dir.create(tmp_out_dir)
     }
  }
  #---------------------------------------------- on windows [ delete the folders inside 'folder_for_all_out' each time   you run the loop because 'unlink' does not work on windows ]
  
  if (.Platform$OS.type == "windows") {
    dir.create(tmp_out_dir)
  }
  #---------------------------------------------- 
  
  save_dat = textTinyR::tokenize_transform_vec_docs(object = text_rows, as_token = T, 
                                                    to_lower = T, 
                                                    remove_punctuation_vector = remove_punct,
                                                    remove_numbers = F, trim_token = T, 
                                                    split_string = T, 
                                                    split_separator = " \r\n\t.,;:()?!//",
                                                    remove_stopwords = T, language = "english", 
                                                    min_num_char = 3, max_num_char = 100, 
                                                    stemmer = "porter2_stemmer", 
                                                    path_2folder = tmp_out_dir,
                                                    vocabulary_path_file = vocab_file,
                                                    threads = THREADS_, verbose = T)
  
  tmp_out_file = file.path(folder_for_all_out, i, 'output_token_single_file.txt')
  
  tmp_file = textTinyR::read_characters(input_file = tmp_out_file, characters = 100000)
  tmp_vocab = read.delim(vocab_file, header = F, stringsAsFactors = F)
  # tmp_vocab = tmp_vocab[order(tmp_vocab$V2, decreasing = T), ]
  lst_vocab_terms_values[[i]] = paste(tmp_vocab$V1, tmp_vocab$V2, sep = '_')
  
  K_BYTES[i] = textTinyR::bytes_converter(input_path_file = tmp_out_file, unit = 'KB')
  CHARS[i] = nchar(tmp_file$data, type = "chars")
}


#---------------------------------------
# check if KB's and number or characters
#---------------------------------------

print(K_BYTES)
print(CHARS)

all(K_BYTES == K_BYTES[1])
all(CHARS == CHARS[1])


#----------------------------------
# check vocabulary terms and values
#----------------------------------

all_vocab = rep(NA, length(lst_vocab_terms_values) - 1)

for (j in 2:length(lst_vocab_terms_values)) {
  tmp_all = lst_vocab_terms_values[[1]] %in% lst_vocab_terms_values[[j]]
  # print(tmp_all)
  all_vocab[j-1] = all(tmp_all)
}

print(all_vocab)

I tested it on Ubuntu 18.10 and on Windows 10. On ubuntu it gives the expected results (although the output text is not in the same order) both with and without parallization whereas on Windows 10 not.

hanson1005 · 2019-02-08T20:24:00Z

Thank you for the quick response. Regarding the third point I made previously, I think the number of words (but not characters) generated with or without removing punctuation should be the same, but it wasn't. On this issue, did you mean that Windows 10 does NOT generate the same number of characters/words with and without punctuation on your trials just as I observed? Is there a solution for that?

…

On Fri, Feb 8, 2019 at 5:03 AM Lampros Mouselimis ***@***.***> wrote: hi @hanson1005 <https://github.com/hanson1005>, thanks for making me aware of this issue. It's a bug related to the windows OS due to the OpenMP parallelization. This issue does not appear on Linux (I haven't tested it on Macintosh yet). I'll upload the updated version tomorrow on the Github repository. To receive the same results for multiple runs you must set the 'threads' parameter to 1. For testing purposes you can use the following code snippet based on the test_text.txt data of the tests folder <https://github.com/mlampros/textTinyR/blob/master/tests/testthat/test_text.txt> (you can adjust it also to your own data set) . There are two cases : *1st.* with or without parallelization, *2nd.* with or without removal of punctuation (adjust it every time you run the for loop). Moreover, on windows OS the 'unlink' base R function does not seem to work, so at each run you must delete the previously created folders inside the 'folder_for_all_out' folder. This for-loop counts the KB's and characters of the output files based on the uncommented variables ('THREADS_' and 'remove_punct'), PATH = 'test_text.txt' text_rows = textTinyR::read_rows(input_file = PATH)$data iters = parallel::detectCores() folder_for_all_out = '/save_folder' K_BYTES = CHARS = rep(NA, iters)lst_vocab_terms_values = list() for (i in 1:iters) { #-------------- 1st. case THREADS_ = 1 # THREADS_ = i #-------------- #------------------- 2nd. case remove_punct = TRUE # remove_punct = FALSE #------------------- tmp_out_dir = paste0(file.path(folder_for_all_out, i), '/') vocab_file = file.path(folder_for_all_out, i, 'VOCAB.txt') #---------------------------------------------- on linux #if (dir.exists(tmp_out_dir)) { # unlink(tmp_out_dir, recursive = T) # dir.create(tmp_out_dir) #} #else { # dir.create(tmp_out_dir) #} #---------------------------------------------- on windows [ delete the folders inside 'folder_for_all_out' each time you run the loop because 'unlink' does not work on windows ] dir.create(tmp_out_dir) #---------------------------------------------- save_dat = textTinyR::tokenize_transform_vec_docs(object = text_rows, as_token = T, to_lower = T, remove_punctuation_vector = remove_punct, remove_numbers = F, trim_token = T, split_string = T, split_separator = " \r\n\t.,;:()?!//", remove_stopwords = T, language = "english", min_num_char = 3, max_num_char = 100, stemmer = "porter2_stemmer", path_2folder = tmp_out_dir, vocabulary_path_file = vocab_file, threads = THREADS_, verbose = T) tmp_out_file = file.path(folder_for_all_out, i, 'output_token_single_file.txt') tmp_file = textTinyR::read_characters(input_file = tmp_out_file, characters = 100000) tmp_vocab = read.delim(vocab_file, header = F, stringsAsFactors = F) # tmp_vocab = tmp_vocab[order(tmp_vocab$V2, decreasing = T), ] lst_vocab_terms_values[[i]] = paste(tmp_vocab$V1, tmp_vocab$V2, sep = '_') K_BYTES[i] = textTinyR::bytes_converter(input_path_file = tmp_out_file, unit = 'KB') CHARS[i] = nchar(tmp_file$data, type = "chars") } #---------------------------------------# check if KB's and number or characters#--------------------------------------- print(K_BYTES) print(CHARS) all(K_BYTES == K_BYTES[1]) all(CHARS == CHARS[1]) #----------------------------------# check vocabulary terms and values#---------------------------------- all_vocab = rep(NA, length(lst_vocab_terms_values) - 1) for (j in 2:length(lst_vocab_terms_values)) { tmp_all = lst_vocab_terms_values[[1]] %in% lst_vocab_terms_values[[j]] # print(tmp_all) all_vocab[j-1] = all(tmp_all) } print(all_vocab) I tested it on Ubuntu 18.10 and on Windows 10. On ubuntu it gives the expected results both with and without parallization whereas on Windows 10 not. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASRWroBz3jhzOhWLAej_q9jZ8DzAYWpyks5vLVlygaJpZM4akgRL> .

-- Ju Yeon Julia Park Ph.D. in Politics New York University

mlampros · 2019-02-09T08:35:10Z

hi @hanson1005,

regarding the third point you've mentioned you should know that depending on the data set it might be the case that a single word equals to a special character (punctuation). Therefore, once you remove the punctuation in the data then these single words (one or multiple special characters) will be also removed. I don't have access to your data to say if that is the case or not, but you can specify a vocabulary path for the 'tokenize_transform_vec_docs()' function and observe on your own if this is the case.

It would be also good that you add a reproducible example for a small subset of your data (or fake data) that produces this behaviour (number of words generated with or without removing punctuation should be the same). That means you expect the number of words before and after execution of the function to be equal.

I 'll update this thread once I upload the new version of the textTinyR package.

mlampros · 2019-02-09T19:01:31Z

@hanson1005,

I updated the textTinyR package. You can read about the changes in the NEWS.md file (version 1.1.3). Please test it and let me know (you can install it using devtools::install_github(repo = 'mlampros/textTinyR') ). Thanks again for spotting these bugs.

stale · 2019-02-21T19:10:19Z

This is Robo-lampros because the Human-lampros is lazy. This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs. Feel free to re-open a closed issue and the Human-lampros will respond.

mlampros · 2019-02-28T19:59:18Z

I'll keep this issue open till I submit the new version of the textTinyR package to CRAN.

hanson1005 · 2019-02-28T23:21:59Z

I have further findings with regards to the first and second issues that I raised in my second posting on this thread regarding the number of characters and number of unique words with and without removing punctuation vectors.

`clust_vec = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T,
to_lower = T,
remove_punctuation_vector = F,
remove_numbers = F, trim_token = T,
split_string = T,
split_separator = " \r\n\t.,;:()?!//",
remove_stopwords = stopwordslist,
stemmer = "porter2_stemmer",
threads = 1, verbose = T)

unq_w_punc = unique(unlist(clust_vec$token, recursive = F))
length(unq_w_punc) # 314,967 unique words (Its corresponding output_token_single_file is 259,210 KB in size)

clust_vec = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T,
to_lower = T,
remove_punctuation_vector = T,
remove_numbers = F, trim_token = T,
split_string = T,
split_separator = " \r\n\t.,;:()?!//",
remove_stopwords = stopwordslist,
stemmer = "porter2_stemmer",
threads = 1, verbose = T)

unq_wo_punc = unique(unlist(clust_vec$token, recursive = F))
length(unq_wo_punc) # 274,845 unique words (Its corresponding output_token_single_file is 324,376 KB in size)

dropped <- unq_w_punc[!(unq_w_punc %in% unq_wo_punc)]
transformed <- unq_wo_punc[!(unq_wo_punc %in% unq_w_punc)]
dropped[1:30] # This shows unique words that are dropped after removing punctuation vectors
[1] "low-level" "now-discredit" "$27" "evid" "government-issu" "''" "catch-22" "no-risk" [9] "win-win" "a" "was--how" "--not" "life--forget" "gulf--who" "well-known" "veter-an"
[17] "all-encompass" "short-sight" "ros-lehtinen" "that--and" "o'clock" "as--excus" "stand--i" "explore'" [25] "is--and" "treatment" "``experi" "`multipl" "'" "here--thi"
transformed[1:30] # This shows unique words that are newly emerged(or transformed) after removing punctuation vectors
[1] "lowlevel" "nowdiscredit" "doesnt" "governmentissu" "catch22" "norisk" "winwin"
[8] "washow" "lifeforget" "gulfwho" "wellknown" "allencompass" "roslehtinen" "oclock"
[15] "asexcus" "standi" "wouldnt" "werent" "wasnt" "didnt" "herethi"
[22] "synergisticpeopl" "thingcan" "sensitivityand" "drugwhi" "arent" "youyou" "testimonytel"
[29] "wrongwith" "weve"

`

What I learn from here is that the reason why the clust_vec with punctuations removed has less unique words yet more characters compared to the clust_vec containing punctuations is as follows:

The former removed special character-equivalent words (e.g. "$27", "``a", "'") so that it has less number of unique words.
However, since it transformed "doesn't" to "doesnt" and "aren't" to "arent", these tremendous amount of stopwords that should have been removed were not removed. Thus, the problem is that the single quote in these well-known stopwords should NOT be removed BEFORE we remove stopwords.
Also, my corpus has too many instances of "word--word" or "word---word", etc. These hyphens in between words could have better be substituted with a space. Transforming "well-known" to "wellknown"or "low-level" to "lowlevel" is fine. But "life--forget" to "lifeforget" is not. It only generates too many infrequent words which worsens the learning of the corpus.

Would you be able to fix these?
For now, before you fix this, I can fix the second point by adding more stopwords to be removed (e.g. "doesnt"), and fix the third point by doing some clean up of my corpus. However, I think these issues should be really fixed.

mlampros · 2019-03-01T21:02:26Z

Hi @hanson1005 and thanks for your detailed comments / suggestions.

When I wrote the package 2 years ago I used 1 or 2 existing packages to compare the output and the runtime (I tried to reproduce as many cases as possible). One of those was the tokenizers package.
I have to confess I struggled a bit to come to a solution for this case, and I found out that the best way would be to modify (remove the punctuation of) the stopwords internally. I updated the 'textTinyR' package based on your suggestions and checked the output with both the tokenisers and the textTinyR package using a simple sentence as a test case ( first you have to download the updated version using devtools::install_github(repo = 'mlampros/textTinyR') ),

# I don't have access to your *stopwordslist* to reproduce your output so I used a specific vector of length 3

sentence = "1. Check hanson's suggestions about words such as doesn't and Aren't'. 2. Moreover, see how the function behaves with her corpus, 
which has too many instances of word--word or word---word or low-level or life--forget. Additional words to check are : $27 and ``A"

t = tokenizers::tokenize_words(sentence, 
                               lowercase = TRUE, 
                               stopwords = c("aren't", "doesn't", "a"), 
                               strip_punct = TRUE, 
                               strip_numeric = FALSE, 
                               simplify = FALSE)

t1 = textTinyR::tokenize_transform_vec_docs(object = c(sentence, sentence),                        # minimum number of sentences is 2
                                            as_token = TRUE,
                                            to_lower = TRUE,
                                            remove_punctuation_string = FALSE,
                                            remove_punctuation_vector = TRUE,
                                            remove_numbers = FALSE, 
                                            trim_token = TRUE,
                                            split_string = TRUE,
                                            split_separator = " \r\n\t.,;:()?!//-",                # I added '-'
                                            remove_stopwords = c("aren't", "doesn't", "a"),
                                            stemmer = NULL, # "porter2_stemmer",
                                            threads = 1, 
                                            verbose = FALSE)
t
t1[[1]][[1]]

Please let me know if that works as expected. I don't expect to give the exact same results as the tokenizers package but I expect that the functions return the appropriate output.

Moreover, I appreciate your suggestions / bug-fixes, so let me know if you would like to make a mention to you in the README.md file of the textTinyR package.

stale · 2019-03-13T21:42:42Z

This is Robo-lampros because the Human-lampros is lazy. This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs. Feel free to re-open a closed issue and the Human-lampros will respond.

stale bot added the stale label Feb 21, 2019

stale bot closed this as completed Feb 28, 2019

mlampros reopened this Feb 28, 2019

stale bot removed the stale label Feb 28, 2019

stale bot added the stale label Mar 13, 2019

stale bot closed this as completed Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What does remove_punctuation_vector do exactly? #8

What does remove_punctuation_vector do exactly? #8

hanson1005 commented Feb 6, 2019

mlampros commented Feb 6, 2019 •

edited

Loading

mlampros commented Feb 7, 2019

hanson1005 commented Feb 8, 2019 via email

mlampros commented Feb 8, 2019 •

edited

Loading

hanson1005 commented Feb 8, 2019 via email

mlampros commented Feb 9, 2019 •

edited

Loading

mlampros commented Feb 9, 2019 •

edited

Loading

stale bot commented Feb 21, 2019

mlampros commented Feb 28, 2019

hanson1005 commented Feb 28, 2019

mlampros commented Mar 1, 2019

stale bot commented Mar 13, 2019

What does remove_punctuation_vector do exactly? #8

What does remove_punctuation_vector do exactly? #8

Comments

hanson1005 commented Feb 6, 2019

mlampros commented Feb 6, 2019 • edited Loading

mlampros commented Feb 7, 2019

hanson1005 commented Feb 8, 2019 via email

mlampros commented Feb 8, 2019 • edited Loading

hanson1005 commented Feb 8, 2019 via email

mlampros commented Feb 9, 2019 • edited Loading

mlampros commented Feb 9, 2019 • edited Loading

stale bot commented Feb 21, 2019

mlampros commented Feb 28, 2019

hanson1005 commented Feb 28, 2019

mlampros commented Mar 1, 2019

stale bot commented Mar 13, 2019

mlampros commented Feb 6, 2019 •

edited

Loading

mlampros commented Feb 8, 2019 •

edited

Loading

mlampros commented Feb 9, 2019 •

edited

Loading

mlampros commented Feb 9, 2019 •

edited

Loading