Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

Support for other langauges #104

Closed
tester88 opened this issue Nov 16, 2015 · 10 comments
Closed

Support for other langauges #104

tester88 opened this issue Nov 16, 2015 · 10 comments

Comments

@tester88
Copy link

Hi,
First of all, thanks for maintaining this.
I just noticed that both token_sort_ratio and token_set_ratio don't support Arabic characters. I don't know about other non-English ones but at lease they don't support Arabic..
It's returning 0 as a result of comparing anything with Arabic string. Even if they were 2 Arabic strings..

>>> print fuzz.token_sort_ratio("مرحبا جميعا", "مرحبا جميعا وشكرا لكم")
0
>>> print fuzz.partial_ratio("مرحبا جميعا", "مرحبا جميعا وشكرا لكم")
100

So I'm just wondering if this's a bug or it simply just doesn't support non-English characters?
Thanks

@josegonzalez
Copy link
Contributor

I think you can set a custom scorer - the default is fuzz.WRatio - that uses non-ascii encoding. By default we do force_ascii=True in scorers, so that probably breaks languages like arabic :(

PR #90 also implements more changes on this front, though it appears the dev stopped working on it. I might try and re-submit the PR myself (and hope the tests pass) later tonight, which should further help solve your issue.

Let me know if this helps.

@tester88
Copy link
Author

Thank you very much and sorry for my late reply,
That makes sense. I will wait for your submission just to make sure I'm using a proper implementation..

Thanks again,

@medecau
Copy link
Contributor

medecau commented Nov 18, 2015

@tester88 you might have some success with a fork I have been working on: https://github.com/medecau/fuzzywuzzy/tree/master

@tester88
Copy link
Author

Great, I will check it out.
Thanks buddy

@mzeidhassan
Copy link

mzeidhassan commented Jan 28, 2020

Hi @josegonzalez
Sorry for re-opening.

Any progress on this. It seems that this is not yet supported for Arabic.

If I run the following code:

from fuzzywuzzy import process
process.extract('الأصبع', ['الإصبع', 'الاصبع', 'الأربع'], scorer=fuzz.ratio)

Although, there are differences in terms of 'hamza' characters (ء), and even this one "الأربع" has a totally different letter that doesn't exist in the rest, I still get score of 83 for all of them.

[('الإصبع', 83), ('الاصبع', 83), ('الأربع', 83)]
@medecau the link is now broken. Can you please double check?

Thanks

@josegonzalez
Copy link
Contributor

There hasn't been any progress. You'd need to write your own custom scorer to handle non-ascii charactersets. It isn't likely that there will be any progress in this repository as the folks that originally worked on and maintained it have since left the company (myself included).

If you find a solution, feel free to post it here for other users.

@mzeidhassan
Copy link

Thanks @josegonzalez for getting back to me. I appreciate it. I am sorry to hear that though. Wish you all the best in your new pursuit.

@WorksbyBBS
Copy link

Have you ever managed to implement this or find a solution of fuzzy string matching for Arabic (python)? I’m looking for something similar as well

@medecau
Copy link
Contributor

medecau commented Jan 21, 2022

@WorksbyBBS
Has said before, if the current functionality does not work for your intended use you'll need to write your own scorer.
That would probably be very useful for other users attempting to do fuzzy string match on Arabic text.

Note that the problem being solved is: fuzzy string match. The nature of what it means "to match" is itself fuzzy.

Good luck.

@medecau
Copy link
Contributor

medecau commented Jan 21, 2022

@mzeidhassan

the link is now broken. Can you please double check?

I no longer support fuzzywuzzy, fuzzywuzzy forks, nor any project started or maintained by seatgeek.
Take a look at RapidFuzz.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants