-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to deal with Chinese characters? #3
Comments
I'm sorry for the late reply, as you already know the fuzzywuzzyR package ports the fuzzywuzzy python library, so I did a search on the issues page of the python library and I found the following three, which might be related to chinese characters (too), seatgeek/fuzzywuzzy#20 The solution to your issue would be to use the force_ascii parameter (note that it doesn't apply to all functions). For instance, library(fuzzywuzzyR)
word = "安广"
word1 = "安徽"
init_scor = FuzzMatcher$new() # initialization of the scorer class
SCOR = init_scor$QRATIO(string1 = word, string2 = word1, force_ascii = FALSE) which returns an error, Error in py_call_impl(callable, dots$args, dots$keywords) :
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position .... It seems to me that it's a decoding issue (ascii encoding). What it worked for me after trial and error was the following code chunk using directly the reticulate package (note that the reticulate package is a dependency of the fuzzywuzzyR package), which requires rudimentary python knowledge, # import the python builtin functions in R
BUILTINS = reticulate::import_builtins(convert = FALSE)
# first convert the chinese characters to a 'python string' and use 'utf-8' decoding
first_word = BUILTINS$str("安广")$decode('utf-8')
second_word= BUILTINS$str("安徽")$decode('utf-8')
third_word = BUILTINS$str("广徽")$decode('utf-8')
fourth_word = BUILTINS$str("安")$decode('utf-8')
# import directly the python fuzzywuzzy library in R
fzr = reticulate::import("fuzzywuzzy")
# 'force_ascii' is set to FALSE as character strings are already decoded
fzr$fuzz$QRatio(first_word, second_word, force_ascii = FALSE)
[1] 50
fzr$fuzz$QRatio(first_word, fourth_word, force_ascii = FALSE)
[1] 67
fzr$fuzz$QRatio(fourth_word, fourth_word, force_ascii = FALSE)
[1] 100 Please let me know if it works (I'm not familiar with the chinese language) |
Thank you very much! This works for me in ubuntu 16.04. If fact, each Chinese character canbe treated as a letter in English. So if I set fzr$fuzz$QRatio("ab", "a", force_ascii = FALSE)
[1] 67 and the result is the same with fzr$fuzz$QRatio(first_word, fourth_word, force_ascii = FALSE)
[1] 67 So this is very helpfull for me. Thank you. Are you planning to put these codes into the package |
I did a slight change to my previous comment concerning the force_ascii = FALSE parameter, which returns a decoding error (ascii encoding). I'll give it a try to add this functionality to the package, so I leave this issue open until I have some results. |
I added the decoding parameter to the following classes : FuzzExtract, FuzzMatcher and FuzzUtils. The decoding parameter does not apply to the GetCloseMatches and SequenceMatcher classes, because there isn't any force_ascii parameter in the difflib python library. Using the initial example, word = "安广"
choices = c("安徽","安广")
init_proc = fuzzywuzzyR::FuzzUtils$new()
# add some special characters
remove_special_chars = paste0(word, "%&$#!")
print(remove_special_chars)
[1] "安广%&$#!"
# 'utf-8' decoding applies only to 'Full_process' method in the 'FuzzUtils' class
PROC = init_proc$Full_process(string = remove_special_chars, decoding = 'utf-8')
print(PROC) # special characters removed
[1] "安广"
# 'utf-8' decoding applies to all methods of the 'FuzzMatcher' class
init_scor = fuzzywuzzyR::FuzzMatcher$new(decoding = 'utf-8')
# normally the 'WRATIO' method is with 'force_ascii = TRUE' initiallized, however here is overwritten by decoding 'utf-8'
SCOR = init_scor$WRATIO
# 'utf-8' decoding applies to all methods of the 'FuzzExtract' class
init <- fuzzywuzzyR::FuzzExtract$new(decoding = 'utf-8')
fzextr = init$Extract(string = word, sequence_strings = choices, scorer = SCOR)
print(fzextr)
[[1]]
[[1]][[1]]
[1] "安广"
[[1]][[2]]
[1] 100
[[2]]
[[2]][[1]]
[1] "安徽"
[[2]][[2]]
[1] 50 I uploaded the updated version of the package to Github, so to install it use devtools::install_github(repo = 'mlampros/fuzzywuzzyR')
Would you mind taking a look at any relevant for your case tests that I added (beginning from line 1748), before I submit the newer version (1.0.2) to CRAN? So that I'm sure I didn't miss something. |
I close this issue for now, feel free to reopen it in case of any errors / bugs. |
When I use the above code, I find that the result are all 0. Is that means that this package cannot work properly when dealing with Chinese characters?
The text was updated successfully, but these errors were encountered: