-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for rapidfuzz #77
Conversation
@orsinium I am not quite sure how to fix the build errors, since I am not very familiar with drone CI + alpine linux. Currently numpy + rapidfuzz do not have musllinux wheels, so it needs to build everything from source. Is there any specific reason why the CI uses alpine linux? Pretty much no python project provides wheels for musllinux so far. |
Thank you! Looks cool. Please, resolve CI, and I'll merge it. No, there is no particular reason to use alpine except that it worked well before. You can migrate it to anything else, buster should do. I think it should be enough. Give it a try and if you can't make it work, just ping me, I'll help. |
@orsinium This will likely take a bit longer, since I am currently working on v2.0.0 of RapidFuzz, which will simplify this PR. I have one question regarding textdistance: |
If I remember correctly, sequence_based are ones that find common subsequences in both given strings. |
@orsinium I finally had time for this. I updated rapidfuzz, which simplifies the addition. In addition I updated the ci to debian. As a side effect this appears to decrease the CI runtime significantly. I do not understand the test failures. From reading it sound like the following occurs: rapidfuzz.distance.Jaro.similarity('0', '0') -> 0.0
rapidfuzz.distance.JaroWinkler.similarity('0', '0') -> 0.0
rapidfuzz.distance.Jaro.similarity([], []) -> 0.0
rapidfuzz.distance.JaroWinkler.similarity([], []) -> 0.0 I can not reproduce the first two issues: >>> from rapidfuzz.distance import Jaro, JaroWinkler
>>> Jaro.similarity('0', '0')
1.0
>>> JaroWinkler.similarity('0', '0')
1.0 the other one comes down to the question, whether the result of matching two empty sequences should be a perfect match or no match. >>> textdistance.algorithms.Jaro().similarity('', '')
1
>>> textdistance.algorithms.Jaro().similarity([], [])
1
>>> rapidfuzz.distance.Jaro.similarity('', '')
0.0
>>> rapidfuzz.distance.Jaro.similarity([], [])
0.0
>>> jellyfish.jaro_similarity('', '')
0.0
>>> Levenshtein.jaro_winkler('', '')
1.0 |
As far as I can see the behavior for two empty sequences does not matter, since it will not be called with two empty sequences anyways: if not all(sequences):
return self.maximum(*sequences)
# try get answer from external libs
answer = self.external_answer(*sequences)
if answer is not None:
return answer |
The point of the two failed tests is to ensure that doesn't matter if you use an external or internal implementation of an algorithm, the result will be the same in both cases. If there is a mismatch, we need to address it on either side.
The test |
If you have issues only with Jaro and Jaro-Winkler, you can disable rapidfuzz for them for now and merge only integration for Hamming and Levenshtein. You've done quite some work, and having something merged is better than nothing :) |
I worked around this issue by handling empty strings in textdistance.
appears this is fixed by fixing the behaviour for empty sequences, so apparently it actually called the lib with empty sequences. |
This comment was marked as resolved.
This comment was marked as resolved.
worked after re-triggering the CI. rebased everything on master and should be ready to merge. |
textdistance/libraries.py
Outdated
@@ -97,6 +97,18 @@ def get_function(self): | |||
# object constructor - the distance metric method is | |||
# called dist_abs() (whereas dist() gives a normalised distance) | |||
obj = getattr(module, self.func_name)().dist_abs | |||
elif self.module_name in {'rapidfuzz.distance.Jaro', 'rapidfuzz.distance.JaroWinkler'}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting 🤔 So, rapidfuzz considers two empty strings to have 0 similarity? Why? An empty string is equal to another empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it does it only for Jaro but not for Hamming, if I understand correctly. Very interesting. Can we fix it upstream? Or there is anything in the original algorithm that says otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does this for everything right now. However you only use the hamming distance, which is 0 in this case and perform the normalisation inside textdistance
.
Looking at wikipedia: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance it appears that the similarity is supposed to be 0 when there are 0 matching characters, which is the case for two empty strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I need to think about it a bit. One of the goals of textdistance is to be consistent in the behavior of all algorithms. And if two strings are equal, they have a zero distance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to textdistance
I have the following algorithms
distance
similarity
normalized_similarity
normalized_distance
for two empty strings I handle this in the following way:
distance -> 0
similarity -> max - distance = 0 - 0
normalized_similarity -> 1.0 - normalized_distance -> 1.0
normalized_distance -> max / dist if max else 0 -> 0
Now that should do. I've patched tests to skip quick results since there are calculated before calling any external libs anyway. UPD: yep, you mentioned it :) |
Thank you! |
The implementation used by rapidfuzz has the following algorithms
Additionally it supports any sequence of hashable types (e.g. lists of strings) and not only text
Here is the benchmark result:
and the benchmark results when adding slightly longer strings: