Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apostrophes in words makes them distinct when they aren't #11

Open
alonscheuer opened this issue Mar 10, 2022 · 2 comments
Open

Apostrophes in words makes them distinct when they aren't #11

alonscheuer opened this issue Mar 10, 2022 · 2 comments

Comments

@alonscheuer
Copy link

Adding an apostrophe (or apostrophes) anywhere in a recognizable word will be treated as a distinct word, but will have the same closeness value as the word without the apostrophes.

For example, all of the following words were accepted as distinct words, and they all had the exact same closeness value:
צבע
צבע'
'צבע
צ'בע
צב'ע
צב'ע'
צ''''בע

More correct behavior would probably be to either reject those words or not count them as distinct from the original.

@ishefi
Copy link
Owner

ishefi commented Mar 10, 2022

Thanks!
It seems that gensim.corpora.wikicorpus which we are using to sanitize the w2v input sanitizes apostrophes by default. It might be the case that it does not have to be the case, but it requires some investigation. In order to allow words like ז'בוטינסקי we are deleting the apostrophes on the server side.

A possible solution to this bug (suggested by @Iddoyadlin) is to delete the apostrophe on the client side, at least until we figure out how to sanitize the data correctly.

@Itamarb01
Copy link

I'm not sure if related to this but ג'ירף and ג'ירפה are not recognized

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants