Add new DamerauLevenshtein... classes #84

juliangilbey · 2022-09-04T15:33:32Z

There are two versions of the Damerau-Levenshtein distance, as described in this Debian bug report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1018933 Some of the external libraries implement one of them, others the other.

This PR splits introduces two different classes: DamerauLevenshteinRestricted and DamerauLevenshteinUnrestricted, with DamerauLevenshtein being the unrestricted version, so that it is clear what is intended.

…ed classes

orsinium · 2022-09-06T10:09:55Z

Great work, thank you! How different are implementations for the two algorithms? Can the same be achieved with the same implementation and restricted flag? Similarly to how JaroWinkler has winklerize flag.

juliangilbey · 2022-09-06T10:37:52Z

I don't know of a way of easily merging the two, but I haven't thought about it much. The unresticted one is also much more complicated (double loop) than the restricted one (single loop), so I would not see any benefit in merging them either.

juliangilbey · 2022-09-06T10:39:42Z

Oh, now I think I see what you mean; two different algorithms but accessed by a single parameterised class. Yes, one could certainly do that.

orsinium · 2022-09-06T10:57:54Z

I can take it from here if you want. I haven't touched textdistance for a while, but that shouldn't be hard.

juliangilbey · 2022-09-08T08:38:39Z

Yes, that would be great, thanks!

A thing that's coming up soon, though it may be somewhat harder to handle, is Python 3.10 and the dependency on abydos. It turns out that the current release of abydos is incompatible with Python 3.10, and though a patch was applied to the release branch a couple of years ago, the developer has not touched the package in about 9 months and has not released a patched version. And using the patched abydos, textdistance.benchmark fails (though the test suite still passes).

Just letting you know...

orsinium · 2022-09-08T09:07:27Z

Yep, I also noticed it. Well, abydos is pretty low in benchmarks, we can drop it.

textdistance/libraries.py

maxbachmann · 2022-09-18T00:20:50Z

textdistance/algorithms/edit_based.py

+                )
+            da[cs1] = i
+
+        return d[len1, len2]


is this really correct? we write into d[i, j] which should be at most d[len1-1, len2-1] from my understanding.

I don't remember much about the algorithms, I can only rely on tests.

I checked the code. What a mess it is, oh my. We first set these values by doing iteration over range(len(s1) + 1) when initializing the matrix, And then when we do enumeration for i, cs1 in enumerate(s1):, we on the next line shift the index: i += 1 🧠 I'll try to clean it up a tiny bit.

This is deliberate! The boundary values are needed, but are never modified later, as the algorithm looks at the preceding values in the array when calculating the new values. I tried to come up with a way of avoiding the i+=1 business, but I kept messing things up, so I decided to be somewhat non-Pythonic and stick to the algorithm as presented on Wikipedia!

I tried to come up with a way of avoiding the i+=1 business, but I kept messing things up

Tip of the day: enumerate accepts an argument start, where you can specify the initial value. I've changed the code to use it. It became a bit slower, for some reason, but IMHO a bit more readable.

Yes I completely missed the += 1 before

I tried to come up with a way of avoiding the i+=1 business, but I kept messing things up

Tip of the day: enumerate accepts an argument start, where you can specify the initial value. I've changed the code to use it. It became a bit slower, for some reason, but IMHO a bit more readable.

Oh, how neat! Thanks for that tip. I agree, more readable is better.

orsinium · 2022-09-18T07:43:30Z

Thank you all :)

juliangilbey force-pushed the split-damerau-levenshtein branch from 74c7b0e to 90a9679 Compare September 4, 2022 15:37

Add new DamerauLevenshteinRestricted and DamerauLevenshteinUnrestrict…

742edf5

…ed classes

juliangilbey force-pushed the split-damerau-levenshtein branch from 90a9679 to 742edf5 Compare September 5, 2022 12:08

orsinium self-assigned this Sep 8, 2022

orsinium added 3 commits September 9, 2022 14:30

Merge branch 'master' into split-damerau-levenshtein

de1bb8e

DamerauLevenshtein: make restricted a flag

789893b

cleanup docstring

983e9c0

maxbachmann reviewed Sep 16, 2022

View reviewed changes

textdistance/libraries.py Show resolved Hide resolved

maxbachmann reviewed Sep 18, 2022

View reviewed changes

orsinium added 2 commits September 18, 2022 09:14

add OSA from rapidfuzz for DL

47703b2

clean-up DL a bit

efd915c

orsinium merged commit 19b7238 into life4:master Sep 18, 2022

juliangilbey deleted the split-damerau-levenshtein branch December 19, 2023 10:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new DamerauLevenshtein... classes #84

Add new DamerauLevenshtein... classes #84

juliangilbey commented Sep 4, 2022

orsinium commented Sep 6, 2022

juliangilbey commented Sep 6, 2022

juliangilbey commented Sep 6, 2022

orsinium commented Sep 6, 2022

juliangilbey commented Sep 8, 2022

orsinium commented Sep 8, 2022

maxbachmann Sep 18, 2022

orsinium Sep 18, 2022 •

edited

Loading

orsinium Sep 18, 2022 •

edited

Loading

juliangilbey Sep 18, 2022

orsinium Sep 19, 2022

maxbachmann Sep 19, 2022

juliangilbey Sep 19, 2022

orsinium commented Sep 18, 2022

Add new DamerauLevenshtein... classes #84

Add new DamerauLevenshtein... classes #84

Conversation

juliangilbey commented Sep 4, 2022

orsinium commented Sep 6, 2022

juliangilbey commented Sep 6, 2022

juliangilbey commented Sep 6, 2022

orsinium commented Sep 6, 2022

juliangilbey commented Sep 8, 2022

orsinium commented Sep 8, 2022

maxbachmann Sep 18, 2022

Choose a reason for hiding this comment

orsinium Sep 18, 2022 • edited Loading

Choose a reason for hiding this comment

orsinium Sep 18, 2022 • edited Loading

Choose a reason for hiding this comment

juliangilbey Sep 18, 2022

Choose a reason for hiding this comment

orsinium Sep 19, 2022

Choose a reason for hiding this comment

maxbachmann Sep 19, 2022

Choose a reason for hiding this comment

juliangilbey Sep 19, 2022

Choose a reason for hiding this comment

orsinium commented Sep 18, 2022

orsinium Sep 18, 2022 •

edited

Loading

orsinium Sep 18, 2022 •

edited

Loading