Support for other langauges #104

tester88 · 2015-11-16T19:43:44Z

Hi,
First of all, thanks for maintaining this.
I just noticed that both token_sort_ratio and token_set_ratio don't support Arabic characters. I don't know about other non-English ones but at lease they don't support Arabic..
It's returning 0 as a result of comparing anything with Arabic string. Even if they were 2 Arabic strings..

>>> print fuzz.token_sort_ratio("مرحبا جميعا", "مرحبا جميعا وشكرا لكم")
0
>>> print fuzz.partial_ratio("مرحبا جميعا", "مرحبا جميعا وشكرا لكم")
100

So I'm just wondering if this's a bug or it simply just doesn't support non-English characters?
Thanks

The text was updated successfully, but these errors were encountered:

josegonzalez · 2015-11-16T20:15:55Z

I think you can set a custom scorer - the default is fuzz.WRatio - that uses non-ascii encoding. By default we do force_ascii=True in scorers, so that probably breaks languages like arabic :(

PR #90 also implements more changes on this front, though it appears the dev stopped working on it. I might try and re-submit the PR myself (and hope the tests pass) later tonight, which should further help solve your issue.

Let me know if this helps.

tester88 · 2015-11-18T15:59:18Z

Thank you very much and sorry for my late reply,
That makes sense. I will wait for your submission just to make sure I'm using a proper implementation..

Thanks again,

medecau · 2015-11-18T19:01:52Z

@tester88 you might have some success with a fork I have been working on: https://github.com/medecau/fuzzywuzzy/tree/master

tester88 · 2015-11-19T13:01:54Z

Great, I will check it out.
Thanks buddy

mzeidhassan · 2020-01-28T22:19:29Z

Hi @josegonzalez
Sorry for re-opening.

Any progress on this. It seems that this is not yet supported for Arabic.

If I run the following code:

from fuzzywuzzy import process
process.extract('الأصبع', ['الإصبع', 'الاصبع', 'الأربع'], scorer=fuzz.ratio)

Although, there are differences in terms of 'hamza' characters (ء), and even this one "الأربع" has a totally different letter that doesn't exist in the rest, I still get score of 83 for all of them.

[('الإصبع', 83), ('الاصبع', 83), ('الأربع', 83)]
@medecau the link is now broken. Can you please double check?

Thanks

josegonzalez · 2020-01-28T22:33:45Z

There hasn't been any progress. You'd need to write your own custom scorer to handle non-ascii charactersets. It isn't likely that there will be any progress in this repository as the folks that originally worked on and maintained it have since left the company (myself included).

If you find a solution, feel free to post it here for other users.

mzeidhassan · 2020-01-29T15:52:56Z

Thanks @josegonzalez for getting back to me. I appreciate it. I am sorry to hear that though. Wish you all the best in your new pursuit.

WorksbyBBS · 2022-01-20T06:30:52Z

Have you ever managed to implement this or find a solution of fuzzy string matching for Arabic (python)? I’m looking for something similar as well

medecau · 2022-01-21T08:18:43Z

@WorksbyBBS
Has said before, if the current functionality does not work for your intended use you'll need to write your own scorer.
That would probably be very useful for other users attempting to do fuzzy string match on Arabic text.

Note that the problem being solved is: fuzzy string match. The nature of what it means "to match" is itself fuzzy.

Good luck.

medecau · 2022-01-21T08:22:53Z

@mzeidhassan

the link is now broken. Can you please double check?

I no longer support fuzzywuzzy, fuzzywuzzy forks, nor any project started or maintained by seatgeek.
Take a look at RapidFuzz.

josegonzalez closed this as completed Jul 22, 2016

mlampros mentioned this issue Dec 16, 2017

How to deal with Chinese characters? mlampros/fuzzywuzzyR#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for other langauges #104

Support for other langauges #104

tester88 commented Nov 16, 2015

josegonzalez commented Nov 16, 2015

tester88 commented Nov 18, 2015

medecau commented Nov 18, 2015

tester88 commented Nov 19, 2015

mzeidhassan commented Jan 28, 2020 •

edited

Loading

josegonzalez commented Jan 28, 2020

mzeidhassan commented Jan 29, 2020

WorksbyBBS commented Jan 20, 2022

medecau commented Jan 21, 2022

medecau commented Jan 21, 2022

Support for other langauges #104

Support for other langauges #104

Comments

tester88 commented Nov 16, 2015

josegonzalez commented Nov 16, 2015

tester88 commented Nov 18, 2015

medecau commented Nov 18, 2015

tester88 commented Nov 19, 2015

mzeidhassan commented Jan 28, 2020 • edited Loading

josegonzalez commented Jan 28, 2020

mzeidhassan commented Jan 29, 2020

WorksbyBBS commented Jan 20, 2022

medecau commented Jan 21, 2022

medecau commented Jan 21, 2022

mzeidhassan commented Jan 28, 2020 •

edited

Loading