Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with Chinese characters? #3

Closed
ctfysh opened this issue Dec 16, 2017 · 5 comments
Closed

How to deal with Chinese characters? #3

ctfysh opened this issue Dec 16, 2017 · 5 comments

Comments

@ctfysh
Copy link

ctfysh commented Dec 16, 2017

library(fuzzywuzzyR)
word = "安广"
choices = c("安徽","安广")
init_proc = FuzzUtils$new()      # initialization of FuzzUtils class to choose a processor
PROC = init_proc$Full_process    # processor-method
init_scor = FuzzMatcher$new()    # initialization of the scorer class
SCOR = init_scor$WRATIO          # choosen scorer function
init <- FuzzExtract$new()        # Initialization of the FuzzExtract class
init$Extract(string = word, sequence_strings = choices, scorer = SCOR)

When I use the above code, I find that the result are all 0. Is that means that this package cannot work properly when dealing with Chinese characters?

@mlampros
Copy link
Owner

mlampros commented Dec 16, 2017

I'm sorry for the late reply,

as you already know the fuzzywuzzyR package ports the fuzzywuzzy python library, so I did a search on the issues page of the python library and I found the following three, which might be related to chinese characters (too),

seatgeek/fuzzywuzzy#20
seatgeek/fuzzywuzzy#104
seatgeek/fuzzywuzzy#82

The solution to your issue would be to use the force_ascii parameter (note that it doesn't apply to all functions). For instance,

library(fuzzywuzzyR)
word = "安广"
word1 = "安徽"

init_scor = FuzzMatcher$new()    # initialization of the scorer class

SCOR = init_scor$QRATIO(string1 = word, string2 = word1, force_ascii = FALSE)

which returns an error,

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position ....

It seems to me that it's a decoding issue (ascii encoding).

What it worked for me after trial and error was the following code chunk using directly the reticulate package (note that the reticulate package is a dependency of the fuzzywuzzyR package), which requires rudimentary python knowledge,

# import the python builtin functions in R

BUILTINS = reticulate::import_builtins(convert = FALSE)

# first convert the chinese characters to a 'python string' and use 'utf-8' decoding

first_word = BUILTINS$str("安广")$decode('utf-8')

second_word= BUILTINS$str("安徽")$decode('utf-8')

third_word = BUILTINS$str("广徽")$decode('utf-8')

fourth_word = BUILTINS$str("")$decode('utf-8')


# import directly the python fuzzywuzzy library in R

fzr = reticulate::import("fuzzywuzzy")


# 'force_ascii' is set to FALSE as character strings are already decoded

fzr$fuzz$QRatio(first_word, second_word, force_ascii = FALSE)     

[1] 50


fzr$fuzz$QRatio(first_word, fourth_word, force_ascii = FALSE)     

[1] 67


fzr$fuzz$QRatio(fourth_word, fourth_word, force_ascii = FALSE)     

[1] 100

Please let me know if it works (I'm not familiar with the chinese language)

@ctfysh
Copy link
Author

ctfysh commented Dec 19, 2017

Thank you very much! This works for me in ubuntu 16.04.

If fact, each Chinese character canbe treated as a letter in English. So if I set "a" = "安" and "b" = "徽", then we can say "ab" = "安徽". To test this hypothesis, I write the following code appending your code:

fzr$fuzz$QRatio("ab", "a", force_ascii = FALSE)
[1] 67

and the result is the same with

fzr$fuzz$QRatio(first_word, fourth_word, force_ascii = FALSE)     
[1] 67

So this is very helpfull for me. Thank you. Are you planning to put these codes into the package fuzzywuzzyR to make it more powerful?

@mlampros
Copy link
Owner

I did a slight change to my previous comment concerning the force_ascii = FALSE parameter, which returns a decoding error (ascii encoding).

I'll give it a try to add this functionality to the package, so I leave this issue open until I have some results.

@mlampros
Copy link
Owner

I added the decoding parameter to the following classes : FuzzExtract, FuzzMatcher and FuzzUtils. The decoding parameter does not apply to the GetCloseMatches and SequenceMatcher classes, because there isn't any force_ascii parameter in the difflib python library.

Using the initial example,

word = "安广"
choices = c("安徽","安广")


init_proc = fuzzywuzzyR::FuzzUtils$new()

# add some special characters
remove_special_chars = paste0(word, "%&$#!")                                      

print(remove_special_chars)

[1] "安广%&$#!"

# 'utf-8' decoding applies only to 'Full_process' method in the 'FuzzUtils' class
PROC = init_proc$Full_process(string = remove_special_chars, decoding = 'utf-8') 

print(PROC)      # special characters removed

[1] "安广"

# 'utf-8' decoding applies to all methods of the 'FuzzMatcher' class
init_scor = fuzzywuzzyR::FuzzMatcher$new(decoding = 'utf-8')                      

# normally the 'WRATIO' method is with 'force_ascii = TRUE' initiallized, however here is overwritten by decoding 'utf-8'
SCOR = init_scor$WRATIO                                                          

# 'utf-8' decoding applies to all methods of the 'FuzzExtract' class
init <- fuzzywuzzyR::FuzzExtract$new(decoding = 'utf-8')                          

fzextr = init$Extract(string = word, sequence_strings = choices, scorer = SCOR)

print(fzextr)

[[1]]
[[1]][[1]]
[1] "安广"

[[1]][[2]]
[1] 100


[[2]]
[[2]][[1]]
[1] "安徽"

[[2]][[2]]
[1] 50

I uploaded the updated version of the package to Github, so to install it use

devtools::install_github(repo = 'mlampros/fuzzywuzzyR')

Would you mind taking a look at any relevant for your case tests that I added (beginning from line 1748), before I submit the newer version (1.0.2) to CRAN? So that I'm sure I didn't miss something.

@mlampros
Copy link
Owner

I close this issue for now, feel free to reopen it in case of any errors / bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants