-
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What does remove_punctuation_vector do exactly? #8
Comments
hi @hanson1005, in the documentation of the textTinyR package and especially for the tokenize_transform_vec_docs function (but not only for this specific function) I added the following, 'remove_punctuation_string' either TRUE or FALSE. If TRUE then the punctuation of the character string will be removed (applies before the split function)
'remove_punctuation_vector' either TRUE or FALSE. If TRUE then the punctuation of the vector of the character strings will be removed (after the string split has taken place)
The difference between the two (remove_punctuation_string and remove_punctuation_vector) has to do with the 'split_string' parameter, 'split_string' either TRUE or FALSE. If TRUE then the character string will be split using the split_separator as delimiter. The user can also specify multiple delimiters.
but with the 'split_separator' and 'as_token' parameters as well. All these parameters intersect depending on the use case. So, it could be the case that you first want to split a character string (sentence) and then remove the punctuation. On the other case you might be interested in removing the punctuation without splitting the character string. library(textTinyR)
token_doc_vec = c("CONVERT to low....er", "remove.. punctuation11234", "trim token and split ")
res = tokenize_transform_vec_docs(object = token_doc_vec,
as_token = TRUE,
to_lower = TRUE,
split_string = TRUE,
split_separator = " \r\n\t,;:()?!//",
remove_punctuation_string = TRUE, # FALSE
remove_punctuation_vector = FALSE) # TRUE
res
To your second point about the reduction of the output size. It has to do with the number of special characters in the input file because the remove_punctuation_vector parameter will remove all special characters of the input data. |
hi @hanson1005 , textTinyR (in comparison to my other R packages) has many parameters, so I took a second look to this issue to find out if there is a bug. I run the textTinyR::tokenize_transform_vec_docs() function with the parameter remove_punctuation_vector set to either FALSE or TRUE and I also saved the vocabulary to a file because I can upload it in R and observe the character strings along with their counts, save_dat = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T,
to_lower = T,
remove_punctuation_vector = FALSE,
remove_numbers = F, trim_token = T,
split_string = T,
split_separator = " \r\n\t.,;:()?!//",
remove_stopwords = T, language = "english",
min_num_char = 3, max_num_char = 100,
stemmer = "porter2_stemmer",
path_2folder = "/path_to_your_folder/",
vocabulary_path_file = "/path_to_your_folder/vocab.txt",
threads = 6, verbose = T)
punct1 = unlist(save_dat$token)
save_dat2 = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T,
to_lower = T,
remove_punctuation_vector = TRUE,
remove_numbers = F, trim_token = T,
split_string = T,
split_separator = " \r\n\t.,;:()?!//",
remove_stopwords = T, language = "english",
min_num_char = 3, max_num_char = 100,
stemmer = "porter2_stemmer",
path_2folder = "/path_to_your_folder2/",
vocabulary_path_file = "/path_to_your_folder2/vocab.txt",
threads = 6, verbose = T)
punct2 = unlist(save_dat2$token)
If you load the two "output_token_single_file.txt" from the two different folders you can count the number of characters that appear in each one using the 'readLines' and 'nchar' base functions, fileName <- '/path_to_your_folder/output_token_single_file.txt'
con <- file(fileName,open="r")
line1 <- readLines(con)
close(con)
str(line1)
nchar(line1)
[1] 4964786
fileName <- '/path_to_your_folder2/output_token_single_file.txt'
con <- file(fileName,open="r")
line2 <- readLines(con)
close(con)
str(line2)
nchar(line2)
[1] 4878736
Therefore by setting "remove_punctuation_vector" to FALSE I receive 4964786 characters whereas when I set "remove_punctuation_vector" to TRUE I receive 4878736 characters. This is actually a difference of approx. 2 % ( (1.0 - (4878736 / 4964786) ) * 100 ). Also the first file has a size of 5.0 MB whereas the file from the second folder has size 4.9 MB, which I think is correct taking into consideration that the second file does not have special characters included. You should also know that R package authors are not allowed based on the CRAN submission policy to modify a users workspace. Previous "output_token_single_file.txt" files are not automatically removed, that means each time you run one of the functions which save an "output_token_single_file.txt" to your target folder you have to remove the previous file otherwise the new data will be added at the end of the file increasing that way the size of the file. Can you please run the previous code snippets by first removing any previously saved files? Moreover, on which operating system do you use the textTinyR package? |
I am using it on Windows 10.
The last point you made solved my problem.
I removed existing "output_token_single_file.txt" and created one with and
without punctuation, and both files generated similar number of words and
characters.
However, there are some other dubious things going on that I newly
discovered.
First, every time I run "save_dat = textTinyR::tokenize_transform_vec_docs"
code, the resulting "output_token_single_file.txt" has slightly different
number of characters, number of words and different size in KB. Is there
any stochastic feature in this? I assume that there shouldn't be, but there
really is. I tried the same code for five times, and all five cases have
different file size. Some have the same number of words, but others don't.
And all five have different number of characters. So, I tried set.seed() to
see if that solves the randomness but it didn't.
Second, following your code that counts the number of characters, the
"output_token_single_file.txt" generated WITHOUT punctuation had MORE
characters than the one WITH punctuation.This is the second weird thing
that I don't understand.
Third, in terms of the number of words (which is printed when running
"fastTextR::skipgram_cbow" command), however, the one WITHOUT punctuation
contains LESS words than the one WITH punctuation. Why does the number of
words change when I remove only the punctuation?
…On Thu, Feb 7, 2019 at 2:46 AM Lampros Mouselimis ***@***.***> wrote:
hi @hanson1005 <https://github.com/hanson1005> ,
textTinyR (in comparison to my other R packages) has many parameters, so I
took a second look to this issue to find out if there is a bug. I run the
*textTinyR::tokenize_transform_vec_docs()* function with the parameter
*remove_punctuation_vector* set to either FALSE or TRUE and I also saved
the vocabulary to a file because I can upload it in R and observe the
character strings along with their counts,
save_dat = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T,
to_lower = T,
remove_punctuation_vector = FALSE,
remove_numbers = F, trim_token = T,
split_string = T,
split_separator = " \r\n\t.,;:()?!//",
remove_stopwords = T, language = "english",
min_num_char = 3, max_num_char = 100,
stemmer = "porter2_stemmer",
path_2folder = "/path_to_your_folder/",
vocabulary_path_file = "/path_to_your_folder/vocab.txt",
threads = 6, verbose = T)
punct1 = unlist(save_dat$token)
save_dat2 = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T,
to_lower = T,
remove_punctuation_vector = TRUE,
remove_numbers = F, trim_token = T,
split_string = T,
split_separator = " \r\n\t.,;:()?!//",
remove_stopwords = T, language = "english",
min_num_char = 3, max_num_char = 100,
stemmer = "porter2_stemmer",
path_2folder = "/path_to_your_folder2/",
vocabulary_path_file = "/path_to_your_folder2/vocab.txt",
threads = 6, verbose = T)
punct2 = unlist(save_dat2$token)
If you load the two "output_token_single_file.txt" from the two different
folders you can count the number of characters that appear in each one
using the 'readLines' and 'nchar' base functions,
fileName <- '/path_to_your_folder/output_token_single_file.txt'con <- file(fileName,open="r")line1 <- readLines(con)
close(con)
str(line1)
nchar(line1)
[1] 4964786
fileName <- '/path_to_your_folder2/output_token_single_file.txt'con <- file(fileName,open="r")line2 <- readLines(con)
close(con)
str(line2)
nchar(line2)
[1] 4878736
Therefore by setting "remove_punctuation_vector" to FALSE I receive
4964786 characters whereas when I set "remove_punctuation_vector" to TRUE I
receive 4878736 characters. This is actually a difference of approx. 2 % (
(1.0 - (4878736 / 4964786) ) * 100 ). Also the first file has a size of 5.0
MB whereas the file from the second folder has size 4.9 MB, which I think
is correct taking into consideration that the second file does not have
special characters included.
You should also know that R package authors are *not allowed* based on
the CRAN submission policy
<https://cran.r-project.org/web/packages/policies.html> to modify a users
workspace. Previous "output_token_single_file.txt" files are not
automatically removed, that means each time you run one of the functions
which save an "output_token_single_file.txt" to your target folder you have
to remove the previous file otherwise the new data will be added at the end
of the file increasing that way the size of the file. Can you please run
the previous code snippets by first removing any previously saved files?
Moreover, on which operating system do you use the textTinyR package?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ASRWrurgNaStpmpLIFZ22GlnlBsuP0lGks5vK-gAgaJpZM4akgRL>
.
--
Ju Yeon Julia Park
Ph.D. in Politics
New York University
|
hi @hanson1005, thanks for making me aware of this issue. It's a bug related to the windows OS due to the OpenMP parallelization. This issue does not appear on Linux (I haven't tested it on Macintosh yet). PATH = 'test_text.txt'
text_rows = textTinyR::read_rows(input_file = PATH)$data
iters = parallel::detectCores()
folder_for_all_out = 'save_folder' # specify a folder to save the output-files
K_BYTES = CHARS = rep(NA, iters)
lst_vocab_terms_values = list()
for (i in 1:iters) {
#-------------- 1st. case
THREADS_ = 1
# THREADS_ = i
#--------------
#------------------- 2nd. case
remove_punct = TRUE
# remove_punct = FALSE
#-------------------
if (.Platform$OS.type == "windows") {
ext = "\\"
}
if (.Platform$OS.type == "unix") {
ext = "/"
}
tmp_out_dir = paste0(file.path(folder_for_all_out, i), ext)
vocab_file = file.path(folder_for_all_out, i, 'VOCAB.txt')
#---------------------------------------------- on linux
if (.Platform$OS.type == "unix") {
if (dir.exists(tmp_out_dir)) {
unlink(tmp_out_dir, recursive = T)
dir.create(tmp_out_dir)
}
else {
dir.create(tmp_out_dir)
}
}
#---------------------------------------------- on windows [ delete the folders inside 'folder_for_all_out' each time you run the loop because 'unlink' does not work on windows ]
if (.Platform$OS.type == "windows") {
dir.create(tmp_out_dir)
}
#----------------------------------------------
save_dat = textTinyR::tokenize_transform_vec_docs(object = text_rows, as_token = T,
to_lower = T,
remove_punctuation_vector = remove_punct,
remove_numbers = F, trim_token = T,
split_string = T,
split_separator = " \r\n\t.,;:()?!//",
remove_stopwords = T, language = "english",
min_num_char = 3, max_num_char = 100,
stemmer = "porter2_stemmer",
path_2folder = tmp_out_dir,
vocabulary_path_file = vocab_file,
threads = THREADS_, verbose = T)
tmp_out_file = file.path(folder_for_all_out, i, 'output_token_single_file.txt')
tmp_file = textTinyR::read_characters(input_file = tmp_out_file, characters = 100000)
tmp_vocab = read.delim(vocab_file, header = F, stringsAsFactors = F)
# tmp_vocab = tmp_vocab[order(tmp_vocab$V2, decreasing = T), ]
lst_vocab_terms_values[[i]] = paste(tmp_vocab$V1, tmp_vocab$V2, sep = '_')
K_BYTES[i] = textTinyR::bytes_converter(input_path_file = tmp_out_file, unit = 'KB')
CHARS[i] = nchar(tmp_file$data, type = "chars")
}
#---------------------------------------
# check if KB's and number or characters
#---------------------------------------
print(K_BYTES)
print(CHARS)
all(K_BYTES == K_BYTES[1])
all(CHARS == CHARS[1])
#----------------------------------
# check vocabulary terms and values
#----------------------------------
all_vocab = rep(NA, length(lst_vocab_terms_values) - 1)
for (j in 2:length(lst_vocab_terms_values)) {
tmp_all = lst_vocab_terms_values[[1]] %in% lst_vocab_terms_values[[j]]
# print(tmp_all)
all_vocab[j-1] = all(tmp_all)
}
print(all_vocab) I tested it on Ubuntu 18.10 and on Windows 10. On ubuntu it gives the expected results (although the output text is not in the same order) both with and without parallization whereas on Windows 10 not. |
Thank you for the quick response.
Regarding the third point I made previously, I think the number of words
(but not characters) generated with or without removing punctuation should
be the same, but it wasn't.
On this issue, did you mean that Windows 10 does NOT generate the same
number of characters/words with and without punctuation on your trials just
as I observed? Is there a solution for that?
…On Fri, Feb 8, 2019 at 5:03 AM Lampros Mouselimis ***@***.***> wrote:
hi @hanson1005 <https://github.com/hanson1005>,
thanks for making me aware of this issue. It's a bug related to the
windows OS due to the OpenMP parallelization. This issue does not appear on
Linux (I haven't tested it on Macintosh yet).
I'll upload the updated version tomorrow on the Github repository. To
receive the same results for multiple runs you must set the 'threads'
parameter to 1. For testing purposes you can use the following code snippet
based on the test_text.txt data of the tests folder
<https://github.com/mlampros/textTinyR/blob/master/tests/testthat/test_text.txt>
(you can adjust it also to your own data set) . There are two cases :
*1st.* with or without parallelization, *2nd.* with or without removal of
punctuation (adjust it every time you run the for loop). Moreover, on
windows OS the 'unlink' base R function does not seem to work, so at each
run you must delete the previously created folders inside the
'folder_for_all_out' folder. This for-loop counts the KB's and characters
of the output files based on the uncommented variables ('THREADS_' and
'remove_punct'),
PATH = 'test_text.txt'
text_rows = textTinyR::read_rows(input_file = PATH)$data
iters = parallel::detectCores()
folder_for_all_out = '/save_folder'
K_BYTES = CHARS = rep(NA, iters)lst_vocab_terms_values = list()
for (i in 1:iters) {
#-------------- 1st. case
THREADS_ = 1
# THREADS_ = i
#--------------
#------------------- 2nd. case
remove_punct = TRUE
# remove_punct = FALSE
#-------------------
tmp_out_dir = paste0(file.path(folder_for_all_out, i), '/')
vocab_file = file.path(folder_for_all_out, i, 'VOCAB.txt')
#---------------------------------------------- on linux
#if (dir.exists(tmp_out_dir)) {
# unlink(tmp_out_dir, recursive = T)
# dir.create(tmp_out_dir)
#}
#else {
# dir.create(tmp_out_dir)
#}
#---------------------------------------------- on windows [ delete the folders inside 'folder_for_all_out' each time you run the loop because 'unlink' does not work on windows ]
dir.create(tmp_out_dir)
#----------------------------------------------
save_dat = textTinyR::tokenize_transform_vec_docs(object = text_rows, as_token = T,
to_lower = T,
remove_punctuation_vector = remove_punct,
remove_numbers = F, trim_token = T,
split_string = T,
split_separator = " \r\n\t.,;:()?!//",
remove_stopwords = T, language = "english",
min_num_char = 3, max_num_char = 100,
stemmer = "porter2_stemmer",
path_2folder = tmp_out_dir,
vocabulary_path_file = vocab_file,
threads = THREADS_, verbose = T)
tmp_out_file = file.path(folder_for_all_out, i, 'output_token_single_file.txt')
tmp_file = textTinyR::read_characters(input_file = tmp_out_file, characters = 100000)
tmp_vocab = read.delim(vocab_file, header = F, stringsAsFactors = F)
# tmp_vocab = tmp_vocab[order(tmp_vocab$V2, decreasing = T), ]
lst_vocab_terms_values[[i]] = paste(tmp_vocab$V1, tmp_vocab$V2, sep = '_')
K_BYTES[i] = textTinyR::bytes_converter(input_path_file = tmp_out_file, unit = 'KB')
CHARS[i] = nchar(tmp_file$data, type = "chars")
}
#---------------------------------------# check if KB's and number or characters#---------------------------------------
print(K_BYTES)
print(CHARS)
all(K_BYTES == K_BYTES[1])
all(CHARS == CHARS[1])
#----------------------------------# check vocabulary terms and values#----------------------------------
all_vocab = rep(NA, length(lst_vocab_terms_values) - 1)
for (j in 2:length(lst_vocab_terms_values)) {
tmp_all = lst_vocab_terms_values[[1]] %in% lst_vocab_terms_values[[j]]
# print(tmp_all)
all_vocab[j-1] = all(tmp_all)
}
print(all_vocab)
I tested it on Ubuntu 18.10 and on Windows 10. On ubuntu it gives the
expected results both with and without parallization whereas on Windows 10
not.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ASRWroBz3jhzOhWLAej_q9jZ8DzAYWpyks5vLVlygaJpZM4akgRL>
.
--
Ju Yeon Julia Park
Ph.D. in Politics
New York University
|
hi @hanson1005, regarding the third point you've mentioned you should know that depending on the data set it might be the case that a single word equals to a special character (punctuation). Therefore, once you remove the punctuation in the data then these single words (one or multiple special characters) will be also removed. I don't have access to your data to say if that is the case or not, but you can specify a vocabulary path for the 'tokenize_transform_vec_docs()' function and observe on your own if this is the case. It would be also good that you add a reproducible example for a small subset of your data (or fake data) that produces this behaviour (number of words generated with or without removing punctuation should be the same). That means you expect the number of words before and after execution of the function to be equal. I 'll update this thread once I upload the new version of the textTinyR package. |
I updated the textTinyR package. You can read about the changes in the NEWS.md file (version 1.1.3). Please test it and let me know (you can install it using devtools::install_github(repo = 'mlampros/textTinyR') ). Thanks again for spotting these bugs. |
This is Robo-lampros because the Human-lampros is lazy. This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs. Feel free to re-open a closed issue and the Human-lampros will respond. |
I'll keep this issue open till I submit the new version of the textTinyR package to CRAN. |
I have further findings with regards to the first and second issues that I raised in my second posting on this thread regarding the number of characters and number of unique words with and without removing punctuation vectors. `clust_vec = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T, unq_w_punc = unique(unlist(clust_vec$token, recursive = F)) clust_vec = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T, unq_wo_punc = unique(unlist(clust_vec$token, recursive = F)) dropped <- unq_w_punc[!(unq_w_punc %in% unq_wo_punc)] ` What I learn from here is that the reason why the clust_vec with punctuations removed has less unique words yet more characters compared to the clust_vec containing punctuations is as follows:
Would you be able to fix these? |
Hi @hanson1005 and thanks for your detailed comments / suggestions. When I wrote the package 2 years ago I used 1 or 2 existing packages to compare the output and the runtime (I tried to reproduce as many cases as possible). One of those was the tokenizers package. # I don't have access to your *stopwordslist* to reproduce your output so I used a specific vector of length 3
sentence = "1. Check hanson's suggestions about words such as doesn't and Aren't'. 2. Moreover, see how the function behaves with her corpus,
which has too many instances of word--word or word---word or low-level or life--forget. Additional words to check are : $27 and ``A"
t = tokenizers::tokenize_words(sentence,
lowercase = TRUE,
stopwords = c("aren't", "doesn't", "a"),
strip_punct = TRUE,
strip_numeric = FALSE,
simplify = FALSE)
t1 = textTinyR::tokenize_transform_vec_docs(object = c(sentence, sentence), # minimum number of sentences is 2
as_token = TRUE,
to_lower = TRUE,
remove_punctuation_string = FALSE,
remove_punctuation_vector = TRUE,
remove_numbers = FALSE,
trim_token = TRUE,
split_string = TRUE,
split_separator = " \r\n\t.,;:()?!//-", # I added '-'
remove_stopwords = c("aren't", "doesn't", "a"),
stemmer = NULL, # "porter2_stemmer",
threads = 1,
verbose = FALSE)
t
t1[[1]][[1]]
Please let me know if that works as expected. I don't expect to give the exact same results as the tokenizers package but I expect that the functions return the appropriate output. Moreover, I appreciate your suggestions / bug-fixes, so let me know if you would like to make a mention to you in the README.md file of the textTinyR package. |
This is Robo-lampros because the Human-lampros is lazy. This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs. Feel free to re-open a closed issue and the Human-lampros will respond. |
Hi. I am following the following code you provided but using my own data:
https://cran.r-project.org/web/packages/textTinyR/vignettes/word_vectors_doc2vec.html
When the following commands are executed, if I change remove_punctuation_vector from FALSE to TRUE (in your vignette, it was FALSE), then the "output_token_single_file.txt" generated significantly reduces in size from 982,000KB to 258,000KB:
clust_vec = textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T, to_lower = T, remove_punctuation_vector = T, .....)
utl = textTinyR::sparse_term_matrix$new(vector_data = concat, file_data = NULL, document_term_matrix = TRUE) tm = utl$Term_Matrix(sort_terms = FALSE, to_lower = T, remove_punctuation_vector = T, ....)
Also, when building word2vec by running the following code, the progress output shows that it reads 145M words after setting "remove_punctuation_vector=F" while it reads only 39M words after setting "remove_punctuation_vector=T" in earlier phase:
vecs= fastTextR::skipgram_cbow(input_path = PATH_INPUT, output_path = PATH_OUT, ... )
So, my question is, isn't it that the remove_punctuation_vector simply removes punctuation from the text? Why does the number of words in corpus reduce into less than one third only with this change?
What does this option really remove?
The text was updated successfully, but these errors were encountered: