Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does remove_punctuation_vector do exactly? #8

Closed
hanson1005 opened this issue Feb 6, 2019 · 12 comments
Closed

What does remove_punctuation_vector do exactly? #8

hanson1005 opened this issue Feb 6, 2019 · 12 comments
Labels

Comments

@hanson1005
Copy link

Hi. I am following the following code you provided but using my own data:
https://cran.r-project.org/web/packages/textTinyR/vignettes/word_vectors_doc2vec.html

When the following commands are executed, if I change remove_punctuation_vector from FALSE to TRUE (in your vignette, it was FALSE), then the "output_token_single_file.txt" generated significantly reduces in size from 982,000KB to 258,000KB:

clust_vec = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T, to_lower = T, remove_punctuation_vector = T, .....)

utl = textTinyR::sparse_term_matrix$new(vector_data = concat, file_data = NULL, document_term_matrix = TRUE) tm = utl$Term_Matrix(sort_terms = FALSE, to_lower = T, remove_punctuation_vector = T, ....)

Also, when building word2vec by running the following code, the progress output shows that it reads 145M words after setting "remove_punctuation_vector=F" while it reads only 39M words after setting "remove_punctuation_vector=T" in earlier phase:

vecs= fastTextR::skipgram_cbow(input_path = PATH_INPUT, output_path = PATH_OUT, ... )

So, my question is, isn't it that the remove_punctuation_vector simply removes punctuation from the text? Why does the number of words in corpus reduce into less than one third only with this change?
What does this option really remove?

@mlampros
Copy link
Owner

mlampros commented Feb 6, 2019

hi @hanson1005,

in the documentation of the textTinyR package and especially for the tokenize_transform_vec_docs function (but not only for this specific function) I added the following,

'remove_punctuation_string'  either TRUE or FALSE. If TRUE then the punctuation of the character string will be removed (applies before the split function)

'remove_punctuation_vector'  either TRUE or FALSE. If TRUE then the punctuation of the vector of the character strings will be removed (after the string split has taken place)

The difference between the two (remove_punctuation_string and remove_punctuation_vector) has to do with the 'split_string' parameter,

'split_string'  either TRUE or FALSE. If TRUE then the character string will be split using the split_separator as delimiter. The user can also specify multiple delimiters.

but with the 'split_separator' and 'as_token' parameters as well.

All these parameters intersect depending on the use case. So, it could be the case that you first want to split a character string (sentence) and then remove the punctuation. On the other case you might be interested in removing the punctuation without splitting the character string.
You can experiment with those parameters using the example I've added in the documentation of the tokenize_transform_vec_docs (be aware that when you use for instance the period ('.') special character in the 'split_separator' then the string will be split in this part of the character string and the special character will not appear in the output),

library(textTinyR)

token_doc_vec = c("CONVERT to low....er", "remove.. punctuation11234", "trim token and split ")

res = tokenize_transform_vec_docs(object = token_doc_vec, 
                                  as_token = TRUE,
                                  to_lower = TRUE, 
                                  split_string = TRUE,
                                  split_separator = " \r\n\t,;:()?!//", 
                                  remove_punctuation_string = TRUE,          # FALSE
                                  remove_punctuation_vector = FALSE)        # TRUE

res

To your second point about the reduction of the output size. It has to do with the number of special characters in the input file because the remove_punctuation_vector parameter will remove all special characters of the input data.

@mlampros
Copy link
Owner

mlampros commented Feb 7, 2019

hi @hanson1005 ,

textTinyR (in comparison to my other R packages) has many parameters, so I took a second look to this issue to find out if there is a bug. I run the textTinyR::tokenize_transform_vec_docs() function with the parameter remove_punctuation_vector set to either FALSE or TRUE and I also saved the vocabulary to a file because I can upload it in R and observe the character strings along with their counts,

save_dat = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T, 
                                                  to_lower = T, 
                                                  remove_punctuation_vector = FALSE,
                                                  remove_numbers = F, trim_token = T, 
                                                  split_string = T, 
                                                  split_separator = " \r\n\t.,;:()?!//",
                                                  remove_stopwords = T, language = "english", 
                                                  min_num_char = 3, max_num_char = 100, 
                                                  stemmer = "porter2_stemmer", 
                                                  path_2folder = "/path_to_your_folder/",
                                                  vocabulary_path_file = "/path_to_your_folder/vocab.txt",
                                                  threads = 6, verbose = T)

punct1 = unlist(save_dat$token)

save_dat2 = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T, 
                                                  to_lower = T, 
                                                  remove_punctuation_vector = TRUE,
                                                  remove_numbers = F, trim_token = T, 
                                                  split_string = T, 
                                                  split_separator = " \r\n\t.,;:()?!//",
                                                  remove_stopwords = T, language = "english", 
                                                  min_num_char = 3, max_num_char = 100, 
                                                  stemmer = "porter2_stemmer", 
                                                  path_2folder = "/path_to_your_folder2/",
                                                  vocabulary_path_file = "/path_to_your_folder2/vocab.txt",
                                                  threads = 6, verbose = T)


punct2 = unlist(save_dat2$token)

If you load the two "output_token_single_file.txt" from the two different folders you can count the number of characters that appear in each one using the 'readLines' and 'nchar' base functions,

fileName <- '/path_to_your_folder/output_token_single_file.txt'
con <- file(fileName,open="r")
line1 <- readLines(con)
close(con)

str(line1)
nchar(line1)

[1] 4964786


fileName <- '/path_to_your_folder2/output_token_single_file.txt'
con <- file(fileName,open="r")
line2 <- readLines(con)
close(con)

str(line2)
nchar(line2)

[1] 4878736

Therefore by setting "remove_punctuation_vector" to FALSE I receive 4964786 characters whereas when I set "remove_punctuation_vector" to TRUE I receive 4878736 characters. This is actually a difference of approx. 2 % ( (1.0 - (4878736 / 4964786) ) * 100 ). Also the first file has a size of 5.0 MB whereas the file from the second folder has size 4.9 MB, which I think is correct taking into consideration that the second file does not have special characters included.

You should also know that R package authors are not allowed based on the CRAN submission policy to modify a users workspace. Previous "output_token_single_file.txt" files are not automatically removed, that means each time you run one of the functions which save an "output_token_single_file.txt" to your target folder you have to remove the previous file otherwise the new data will be added at the end of the file increasing that way the size of the file. Can you please run the previous code snippets by first removing any previously saved files?

Moreover, on which operating system do you use the textTinyR package?

@hanson1005
Copy link
Author

hanson1005 commented Feb 8, 2019 via email

@mlampros
Copy link
Owner

mlampros commented Feb 8, 2019

hi @hanson1005,

thanks for making me aware of this issue. It's a bug related to the windows OS due to the OpenMP parallelization. This issue does not appear on Linux (I haven't tested it on Macintosh yet).
I'll upload the updated version tomorrow on the Github repository. To receive the same results for multiple runs you must set the 'threads' parameter to 1. For testing purposes you can use the following code snippet based on the test_text.txt data of the tests folder (you can adjust it also to your own data set) . There are two cases : 1st. with or without parallelization, 2nd. with or without removal of punctuation (adjust it every time you run the for loop). Moreover, on windows OS the 'unlink' base R function does not seem to work, so at each run you must delete the previously created folders inside the 'folder_for_all_out' folder. This for-loop counts the KB's and characters of the output files based on the uncommented variables ('THREADS_' and 'remove_punct'),

PATH = 'test_text.txt'

text_rows = textTinyR::read_rows(input_file = PATH)$data

iters = parallel::detectCores()

folder_for_all_out = 'save_folder'          # specify a folder to save the output-files

K_BYTES = CHARS = rep(NA, iters)
lst_vocab_terms_values = list()


for (i in 1:iters) {
  
  #-------------- 1st. case
  THREADS_ = 1
  # THREADS_ = i
  #--------------
  
  #------------------- 2nd. case
  remove_punct = TRUE
  # remove_punct = FALSE
  #-------------------  

  if (.Platform$OS.type == "windows") {
    ext = "\\"
  }
  if (.Platform$OS.type == "unix") {
    ext = "/"
  }

  tmp_out_dir = paste0(file.path(folder_for_all_out, i), ext)
  vocab_file = file.path(folder_for_all_out, i, 'VOCAB.txt')
  
  #---------------------------------------------- on linux
  if (.Platform$OS.type == "unix") {
     if (dir.exists(tmp_out_dir)) {
       unlink(tmp_out_dir, recursive = T)
       dir.create(tmp_out_dir)
     }
     else {
       dir.create(tmp_out_dir)
     }
  }
  #---------------------------------------------- on windows [ delete the folders inside 'folder_for_all_out' each time   you run the loop because 'unlink' does not work on windows ]
  
  if (.Platform$OS.type == "windows") {
    dir.create(tmp_out_dir)
  }
  #---------------------------------------------- 
  
  save_dat = textTinyR::tokenize_transform_vec_docs(object = text_rows, as_token = T, 
                                                    to_lower = T, 
                                                    remove_punctuation_vector = remove_punct,
                                                    remove_numbers = F, trim_token = T, 
                                                    split_string = T, 
                                                    split_separator = " \r\n\t.,;:()?!//",
                                                    remove_stopwords = T, language = "english", 
                                                    min_num_char = 3, max_num_char = 100, 
                                                    stemmer = "porter2_stemmer", 
                                                    path_2folder = tmp_out_dir,
                                                    vocabulary_path_file = vocab_file,
                                                    threads = THREADS_, verbose = T)
  
  tmp_out_file = file.path(folder_for_all_out, i, 'output_token_single_file.txt')
  
  tmp_file = textTinyR::read_characters(input_file = tmp_out_file, characters = 100000)
  tmp_vocab = read.delim(vocab_file, header = F, stringsAsFactors = F)
  # tmp_vocab = tmp_vocab[order(tmp_vocab$V2, decreasing = T), ]
  lst_vocab_terms_values[[i]] = paste(tmp_vocab$V1, tmp_vocab$V2, sep = '_')
  
  K_BYTES[i] = textTinyR::bytes_converter(input_path_file = tmp_out_file, unit = 'KB')
  CHARS[i] = nchar(tmp_file$data, type = "chars")
}


#---------------------------------------
# check if KB's and number or characters
#---------------------------------------

print(K_BYTES)
print(CHARS)

all(K_BYTES == K_BYTES[1])
all(CHARS == CHARS[1])


#----------------------------------
# check vocabulary terms and values
#----------------------------------

all_vocab = rep(NA, length(lst_vocab_terms_values) - 1)

for (j in 2:length(lst_vocab_terms_values)) {
  tmp_all = lst_vocab_terms_values[[1]] %in% lst_vocab_terms_values[[j]]
  # print(tmp_all)
  all_vocab[j-1] = all(tmp_all)
}

print(all_vocab)

I tested it on Ubuntu 18.10 and on Windows 10. On ubuntu it gives the expected results (although the output text is not in the same order) both with and without parallization whereas on Windows 10 not.

@hanson1005
Copy link
Author

hanson1005 commented Feb 8, 2019 via email

@mlampros
Copy link
Owner

mlampros commented Feb 9, 2019

hi @hanson1005,

regarding the third point you've mentioned you should know that depending on the data set it might be the case that a single word equals to a special character (punctuation). Therefore, once you remove the punctuation in the data then these single words (one or multiple special characters) will be also removed. I don't have access to your data to say if that is the case or not, but you can specify a vocabulary path for the 'tokenize_transform_vec_docs()' function and observe on your own if this is the case.

It would be also good that you add a reproducible example for a small subset of your data (or fake data) that produces this behaviour (number of words generated with or without removing punctuation should be the same). That means you expect the number of words before and after execution of the function to be equal.

I 'll update this thread once I upload the new version of the textTinyR package.

@mlampros
Copy link
Owner

mlampros commented Feb 9, 2019

@hanson1005,

I updated the textTinyR package. You can read about the changes in the NEWS.md file (version 1.1.3). Please test it and let me know (you can install it using devtools::install_github(repo = 'mlampros/textTinyR') ). Thanks again for spotting these bugs.

@stale
Copy link

stale bot commented Feb 21, 2019

This is Robo-lampros because the Human-lampros is lazy. This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs. Feel free to re-open a closed issue and the Human-lampros will respond.

@stale stale bot added the stale label Feb 21, 2019
@stale stale bot closed this as completed Feb 28, 2019
@mlampros
Copy link
Owner

I'll keep this issue open till I submit the new version of the textTinyR package to CRAN.

@mlampros mlampros reopened this Feb 28, 2019
@stale stale bot removed the stale label Feb 28, 2019
@hanson1005
Copy link
Author

I have further findings with regards to the first and second issues that I raised in my second posting on this thread regarding the number of characters and number of unique words with and without removing punctuation vectors.

`clust_vec = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T,
to_lower = T,
remove_punctuation_vector = F,
remove_numbers = F, trim_token = T,
split_string = T,
split_separator = " \r\n\t.,;:()?!//",
remove_stopwords = stopwordslist,
stemmer = "porter2_stemmer",
threads = 1, verbose = T)

unq_w_punc = unique(unlist(clust_vec$token, recursive = F))
length(unq_w_punc) # 314,967 unique words (Its corresponding output_token_single_file is 259,210 KB in size)

clust_vec = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T,
to_lower = T,
remove_punctuation_vector = T,
remove_numbers = F, trim_token = T,
split_string = T,
split_separator = " \r\n\t.,;:()?!//",
remove_stopwords = stopwordslist,
stemmer = "porter2_stemmer",
threads = 1, verbose = T)

unq_wo_punc = unique(unlist(clust_vec$token, recursive = F))
length(unq_wo_punc) # 274,845 unique words (Its corresponding output_token_single_file is 324,376 KB in size)

dropped <- unq_w_punc[!(unq_w_punc %in% unq_wo_punc)]
transformed <- unq_wo_punc[!(unq_wo_punc %in% unq_w_punc)]
dropped[1:30] # This shows unique words that are dropped after removing punctuation vectors
[1] "low-level" "now-discredit" "$27" "evid" "government-issu" "''" "catch-22" "no-risk" [9] "win-win" "a" "was--how" "--not" "life--forget" "gulf--who" "well-known" "veter-an"
[17] "all-encompass" "short-sight" "ros-lehtinen" "that--and" "o'clock" "as--excus" "stand--i" "explore'" [25] "is--and" "treatment" "``experi" "`multipl" "'" "here--thi"
transformed[1:30] # This shows unique words that are newly emerged(or transformed) after removing punctuation vectors
[1] "lowlevel" "nowdiscredit" "doesnt" "governmentissu" "catch22" "norisk" "winwin"
[8] "washow" "lifeforget" "gulfwho" "wellknown" "allencompass" "roslehtinen" "oclock"
[15] "asexcus" "standi" "wouldnt" "werent" "wasnt" "didnt" "herethi"
[22] "synergisticpeopl" "thingcan" "sensitivityand" "drugwhi" "arent" "youyou" "testimonytel"
[29] "wrongwith" "weve"

`

What I learn from here is that the reason why the clust_vec with punctuations removed has less unique words yet more characters compared to the clust_vec containing punctuations is as follows:

  1. The former removed special character-equivalent words (e.g. "$27", "``a", "'") so that it has less number of unique words.
  2. However, since it transformed "doesn't" to "doesnt" and "aren't" to "arent", these tremendous amount of stopwords that should have been removed were not removed. Thus, the problem is that the single quote in these well-known stopwords should NOT be removed BEFORE we remove stopwords.
  3. Also, my corpus has too many instances of "word--word" or "word---word", etc. These hyphens in between words could have better be substituted with a space. Transforming "well-known" to "wellknown"or "low-level" to "lowlevel" is fine. But "life--forget" to "lifeforget" is not. It only generates too many infrequent words which worsens the learning of the corpus.

Would you be able to fix these?
For now, before you fix this, I can fix the second point by adding more stopwords to be removed (e.g. "doesnt"), and fix the third point by doing some clean up of my corpus. However, I think these issues should be really fixed.

@mlampros
Copy link
Owner

mlampros commented Mar 1, 2019

Hi @hanson1005 and thanks for your detailed comments / suggestions.

When I wrote the package 2 years ago I used 1 or 2 existing packages to compare the output and the runtime (I tried to reproduce as many cases as possible). One of those was the tokenizers package.
I have to confess I struggled a bit to come to a solution for this case, and I found out that the best way would be to modify (remove the punctuation of) the stopwords internally. I updated the 'textTinyR' package based on your suggestions and checked the output with both the tokenisers and the textTinyR package using a simple sentence as a test case ( first you have to download the updated version using devtools::install_github(repo = 'mlampros/textTinyR') ),

# I don't have access to your *stopwordslist* to reproduce your output so I used a specific vector of length 3

sentence = "1. Check hanson's suggestions about words such as doesn't and Aren't'. 2. Moreover, see how the function behaves with her corpus, 
which has too many instances of word--word or word---word or low-level or life--forget. Additional words to check are : $27 and ``A"

t = tokenizers::tokenize_words(sentence, 
                               lowercase = TRUE, 
                               stopwords = c("aren't", "doesn't", "a"), 
                               strip_punct = TRUE, 
                               strip_numeric = FALSE, 
                               simplify = FALSE)

t1 = textTinyR::tokenize_transform_vec_docs(object = c(sentence, sentence),                        # minimum number of sentences is 2
                                            as_token = TRUE,
                                            to_lower = TRUE,
                                            remove_punctuation_string = FALSE,
                                            remove_punctuation_vector = TRUE,
                                            remove_numbers = FALSE, 
                                            trim_token = TRUE,
                                            split_string = TRUE,
                                            split_separator = " \r\n\t.,;:()?!//-",                # I added '-'
                                            remove_stopwords = c("aren't", "doesn't", "a"),
                                            stemmer = NULL, # "porter2_stemmer",
                                            threads = 1, 
                                            verbose = FALSE)
t
t1[[1]][[1]]

Please let me know if that works as expected. I don't expect to give the exact same results as the tokenizers package but I expect that the functions return the appropriate output.

Moreover, I appreciate your suggestions / bug-fixes, so let me know if you would like to make a mention to you in the README.md file of the textTinyR package.

@stale
Copy link

stale bot commented Mar 13, 2019

This is Robo-lampros because the Human-lampros is lazy. This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs. Feel free to re-open a closed issue and the Human-lampros will respond.

@stale stale bot added the stale label Mar 13, 2019
@stale stale bot closed this as completed Mar 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants