Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dictionary index case-sensitivity inconsistencies #121

Closed
wants to merge 1 commit into from

Conversation

welps
Copy link
Contributor

@welps welps commented Jun 12, 2020

Hi,

I pulled the gcide dictionary into plato and noticed that the case-sensitivity search was not working. This was because the gcide dictionary index does not case fold the headwords in the index. Looking at some other dictionaries, this seems to be inconsistently handled.

So this PR provides the following fixes (and tests):

  • Handling of dictionary index parsing from three possible states:

    • Full index parse
    • Lazy index parse (metadata only)
    • Resuming from lazy index parse
  • Casefolding (accounting for non-latin characters) for the dictionary-side query and when the index is being created within plato

Tested via emulator and on my Forma:

image

@baskerville
Copy link
Owner

I would rather not add workarounds for invalid dictionaries. I have already mentioned this problem. Where did you download the dictionary from?

@welps
Copy link
Contributor Author

welps commented Jun 13, 2020

I downloaded mine via arch's user repository, but I confirmed debian's official package repository compiles the index with the same inconsistent case.

I see that your post cites the same dictionary, the reason I used it is because it's seemingly the largest english dictionary available, there really don't seem to be too many options.

Since dict handles the "invalid" gcide properly, looking at the source code, note I'm not a C programmer so I may be wrong, it seems like their solution is to maintain a separate lowercase index which is similar to what I do.

If that's really the case, no pun intended, I would argue that it shouldn't be considered a workaround and just standard data munging.

@welps
Copy link
Contributor Author

welps commented Jun 22, 2020

@baskerville Is there anything I can do to address any concerns you may have?

If the primary dictd implementation is doing the same thing with case-insensitive dictionaries (see last comment if you missed it), why shouldn't we?

If there are concerns with the code itself, let me know and I can work to address your concerns. Thanks.

@baskerville
Copy link
Owner

I don't know why dictd is doing this: the dictionary you mentioned cannot be produced by dictfmt.

When dictfmt generates the index, it applies the necessary character transformations to the headwords.

You could prevent the headwords from being lowercased with --case-sensitive (in which case, the corresponding special entry will appear within the index), but why would you want that?

@welps
Copy link
Contributor Author

welps commented Jun 22, 2020

This is conjecture, but I believe gcide predates dictfmt and it was simply never updated to go through dictfmt. It seems dictfmt was introduced in 2002. gcide is at least 20 years old.

Case-sensitive handling was introduced in 2007 and the default handling was lowercasing the index which is probably why it never got addressed within gcide.

I understand your reluctance, but gcide seems to be the most popular dictionary alongside WordNet. If we think about how English users actually go about acquiring a dictionary for plato, which I think is already a bit difficult, they're going to end up seeing some variation of https://packages.debian.org/buster/dict-wn and https://packages.debian.org/buster/dict-gcide. Only one of those explicitly say it's a dictionary.

@baskerville
Copy link
Owner

Being aware of the dictionary situation, I did create my own version of WordNet 3.1 so that there would be at least one good english dictionary that works with Plato.

@welps
Copy link
Contributor Author

welps commented Jun 26, 2020

@baskerville

There would be another good english dictionary that works with Plato if this code was merged though. It works with the official dictd implementation as we've discussed so why shouldn't it here?

Why is the solution to produce and maintain another bespoke dictionary instead of leveraging what has existed for twenty+ years?

@baskerville
Copy link
Owner

Don't get me wrong: I'm acknowledging the weird backward compatibility problem.

I'm just looking the for most straightforward approach to solving this problem.

Fortunately, the dictionaries generated with dictfmt will have a 00-database-dictfmt-VERSION entry (unless --without-ver is passed!).

Have you found other dictd dictionaries, besides GCIDE, that aren't generated by dictfmt?

@welps
Copy link
Contributor Author

welps commented Jun 26, 2020

Yes, the dict-moby-thesaurus package does not conform either.

It maintains case for proper nouns, but does not declare 00-database-case-sensitive. See below for a snippet.

I did not realize you are also involved in https://github.com/freedict/libdict so I understand that your concerns about this may extend more outwards than I was aware of. I really believe the most straightforward solution is to case fold the query and the index headword.

Related to this, I did discover the rust-caseless default case folding function does not perform any normalization. The canonical caseless matching strategy recommended in Unicode Section 3.13 requires NFD normalization before/after case folding of a given word while noting NFD normalization after case folding is sufficient to handle most cases. However, Rust's unicode-normalization crate was recently shown to be 2-25x slower than ICU so maybe we don't want to deal with this at all for now.

$ head -100 /usr/share/dictd/moby-thesaurus.index

00-database-info        CV      Jj
00-database-short       BZ      8
00-database-url A       BZ
a cappella      L4      Ho
a la mode       Tg      Hs
a priori        bM      GC
ab ovo  mp      JC
Abaddon vr      G3
abandon 2i      BB2
abandoned       B4Y     1B
abase   CtZ     Fg
abasement       Cy5     HM
abash   C6F     Hh
abashed DBm     LS
abate   DM4     ih
abatement       DvZ     Vq
abatis  EFD     Ki
abbe    EPl     GA
abbess  EVl     Fm
abbot   EbL     FE
abbreviate      EgP     OF
abbreviated     EuU     Js
abbreviation    E4A     Sm
abdicate        FKm     I4
abdication      FTe     Fp
abdomen FZH     N1
abdominal       Fm8     Eg
abduction       Frc     Fw
abecedarian     FxM     Wq
abecedary       GH2     GB
aberrant        GN3     YP
aberration      GmG     ql
abet    HQr     NS
abettor Hd9     NI
abeyance        HrF     Mo
abhor   H3t     EC
abhorrent       H7v     LQ
abide   IPa     eS
abide by        IG/     Ib
abiding Its     Py
ability I9e     YO
abject  JVs     Vr
abjection       JrX     Fm
abjuration      Jw9     UW
abjure  KFT     WX
ablate  Kbq     RM
ablation        Ks2     Rv
ablaze  K+l     RB
able    LPm     Ob
ablution        LeB     GC
ably    LkD     HT
abnegation      LrW     TF
abnormal        L+b     ca
abnormality     Ma1     vy
aboard  NKn     Gc
abode   NRD     J0
abolish Na3     Jk
abolition       Nkb     L+
A-bomb  hO      Fb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants