How to deal with Chinese characters? #3

ctfysh · 2017-12-16T12:34:09Z

library(fuzzywuzzyR)
word = "安广"
choices = c("安徽","安广")
init_proc = FuzzUtils$new()      # initialization of FuzzUtils class to choose a processor
PROC = init_proc$Full_process    # processor-method
init_scor = FuzzMatcher$new()    # initialization of the scorer class
SCOR = init_scor$WRATIO          # choosen scorer function
init <- FuzzExtract$new()        # Initialization of the FuzzExtract class
init$Extract(string = word, sequence_strings = choices, scorer = SCOR)

When I use the above code, I find that the result are all 0. Is that means that this package cannot work properly when dealing with Chinese characters?

mlampros · 2017-12-16T19:31:20Z

I'm sorry for the late reply,

as you already know the fuzzywuzzyR package ports the fuzzywuzzy python library, so I did a search on the issues page of the python library and I found the following three, which might be related to chinese characters (too),

seatgeek/fuzzywuzzy#20
seatgeek/fuzzywuzzy#104
seatgeek/fuzzywuzzy#82

The solution to your issue would be to use the force_ascii parameter (note that it doesn't apply to all functions). For instance,

library(fuzzywuzzyR)
word = "安广"
word1 = "安徽"

init_scor = FuzzMatcher$new()    # initialization of the scorer class

SCOR = init_scor$QRATIO(string1 = word, string2 = word1, force_ascii = FALSE)

which returns an error,

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position ....

It seems to me that it's a decoding issue (ascii encoding).

What it worked for me after trial and error was the following code chunk using directly the reticulate package (note that the reticulate package is a dependency of the fuzzywuzzyR package), which requires rudimentary python knowledge,

# import the python builtin functions in R

BUILTINS = reticulate::import_builtins(convert = FALSE)

# first convert the chinese characters to a 'python string' and use 'utf-8' decoding

first_word = BUILTINS$str("安广")$decode('utf-8')

second_word= BUILTINS$str("安徽")$decode('utf-8')

third_word = BUILTINS$str("广徽")$decode('utf-8')

fourth_word = BUILTINS$str("安")$decode('utf-8')


# import directly the python fuzzywuzzy library in R

fzr = reticulate::import("fuzzywuzzy")


# 'force_ascii' is set to FALSE as character strings are already decoded

fzr$fuzz$QRatio(first_word, second_word, force_ascii = FALSE)     

[1] 50


fzr$fuzz$QRatio(first_word, fourth_word, force_ascii = FALSE)     

[1] 67


fzr$fuzz$QRatio(fourth_word, fourth_word, force_ascii = FALSE)     

[1] 100

Please let me know if it works (I'm not familiar with the chinese language)

ctfysh · 2017-12-19T05:06:39Z

Thank you very much! This works for me in ubuntu 16.04.

If fact, each Chinese character canbe treated as a letter in English. So if I set "a" = "安" and "b" = "徽", then we can say "ab" = "安徽". To test this hypothesis, I write the following code appending your code:

fzr$fuzz$QRatio("ab", "a", force_ascii = FALSE)
[1] 67

and the result is the same with

fzr$fuzz$QRatio(first_word, fourth_word, force_ascii = FALSE)     
[1] 67

So this is very helpfull for me. Thank you. Are you planning to put these codes into the package fuzzywuzzyR to make it more powerful?

mlampros · 2017-12-19T09:01:35Z

I did a slight change to my previous comment concerning the force_ascii = FALSE parameter, which returns a decoding error (ascii encoding).

I'll give it a try to add this functionality to the package, so I leave this issue open until I have some results.

mlampros · 2017-12-19T18:56:23Z

I added the decoding parameter to the following classes : FuzzExtract, FuzzMatcher and FuzzUtils. The decoding parameter does not apply to the GetCloseMatches and SequenceMatcher classes, because there isn't any force_ascii parameter in the difflib python library.

Using the initial example,

word = "安广"
choices = c("安徽","安广")


init_proc = fuzzywuzzyR::FuzzUtils$new()

# add some special characters
remove_special_chars = paste0(word, "%&$#!")                                      

print(remove_special_chars)

[1] "安广%&$#!"

# 'utf-8' decoding applies only to 'Full_process' method in the 'FuzzUtils' class
PROC = init_proc$Full_process(string = remove_special_chars, decoding = 'utf-8') 

print(PROC)      # special characters removed

[1] "安广"

# 'utf-8' decoding applies to all methods of the 'FuzzMatcher' class
init_scor = fuzzywuzzyR::FuzzMatcher$new(decoding = 'utf-8')                      

# normally the 'WRATIO' method is with 'force_ascii = TRUE' initiallized, however here is overwritten by decoding 'utf-8'
SCOR = init_scor$WRATIO                                                          

# 'utf-8' decoding applies to all methods of the 'FuzzExtract' class
init <- fuzzywuzzyR::FuzzExtract$new(decoding = 'utf-8')                          

fzextr = init$Extract(string = word, sequence_strings = choices, scorer = SCOR)

print(fzextr)

[[1]]
[[1]][[1]]
[1] "安广"

[[1]][[2]]
[1] 100


[[2]]
[[2]][[1]]
[1] "安徽"

[[2]][[2]]
[1] 50

I uploaded the updated version of the package to Github, so to install it use

devtools::install_github(repo = 'mlampros/fuzzywuzzyR')

Would you mind taking a look at any relevant for your case tests that I added (beginning from line 1748), before I submit the newer version (1.0.2) to CRAN? So that I'm sure I didn't miss something.

mlampros · 2017-12-22T12:52:10Z

I close this issue for now, feel free to reopen it in case of any errors / bugs.

mlampros closed this as completed Dec 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deal with Chinese characters? #3

How to deal with Chinese characters? #3

ctfysh commented Dec 16, 2017 •

edited

Loading

mlampros commented Dec 16, 2017 •

edited

Loading

ctfysh commented Dec 19, 2017

mlampros commented Dec 19, 2017

mlampros commented Dec 19, 2017

mlampros commented Dec 22, 2017

How to deal with Chinese characters? #3

How to deal with Chinese characters? #3

Comments

ctfysh commented Dec 16, 2017 • edited Loading

mlampros commented Dec 16, 2017 • edited Loading

ctfysh commented Dec 19, 2017

mlampros commented Dec 19, 2017

mlampros commented Dec 19, 2017

mlampros commented Dec 22, 2017

ctfysh commented Dec 16, 2017 •

edited

Loading

mlampros commented Dec 16, 2017 •

edited

Loading