-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new DamerauLevenshtein... classes #84
Add new DamerauLevenshtein... classes #84
Conversation
74c7b0e
to
90a9679
Compare
90a9679
to
742edf5
Compare
Great work, thank you! How different are implementations for the two algorithms? Can the same be achieved with the same implementation and |
I don't know of a way of easily merging the two, but I haven't thought about it much. The unresticted one is also much more complicated (double loop) than the restricted one (single loop), so I would not see any benefit in merging them either. |
Oh, now I think I see what you mean; two different algorithms but accessed by a single parameterised class. Yes, one could certainly do that. |
I can take it from here if you want. I haven't touched textdistance for a while, but that shouldn't be hard. |
Yes, that would be great, thanks! A thing that's coming up soon, though it may be somewhat harder to handle, is Python 3.10 and the dependency on abydos. It turns out that the current release of abydos is incompatible with Python 3.10, and though a patch was applied to the release branch a couple of years ago, the developer has not touched the package in about 9 months and has not released a patched version. And using the patched abydos, textdistance.benchmark fails (though the test suite still passes). Just letting you know... |
Yep, I also noticed it. Well, abydos is pretty low in benchmarks, we can drop it. |
) | ||
da[cs1] = i | ||
|
||
return d[len1, len2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this really correct? we write into d[i, j]
which should be at most d[len1-1, len2-1]
from my understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember much about the algorithms, I can only rely on tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the code. What a mess it is, oh my. We first set these values by doing iteration over range(len(s1) + 1)
when initializing the matrix, And then when we do enumeration for i, cs1 in enumerate(s1):
, we on the next line shift the index: i += 1
🧠 I'll try to clean it up a tiny bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is deliberate! The boundary values are needed, but are never modified later, as the algorithm looks at the preceding values in the array when calculating the new values. I tried to come up with a way of avoiding the i+=1 business, but I kept messing things up, so I decided to be somewhat non-Pythonic and stick to the algorithm as presented on Wikipedia!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to come up with a way of avoiding the i+=1 business, but I kept messing things up
Tip of the day: enumerate
accepts an argument start
, where you can specify the initial value. I've changed the code to use it. It became a bit slower, for some reason, but IMHO a bit more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I completely missed the += 1 before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to come up with a way of avoiding the i+=1 business, but I kept messing things up
Tip of the day:
enumerate
accepts an argumentstart
, where you can specify the initial value. I've changed the code to use it. It became a bit slower, for some reason, but IMHO a bit more readable.
Oh, how neat! Thanks for that tip. I agree, more readable is better.
Thank you all :) |
There are two versions of the Damerau-Levenshtein distance, as described in this Debian bug report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1018933 Some of the external libraries implement one of them, others the other.
This PR splits introduces two different classes:
DamerauLevenshteinRestricted
andDamerauLevenshteinUnrestricted
, withDamerauLevenshtein
being the unrestricted version, so that it is clear what is intended.