Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proper names overrunning top word list #23

Open
matanninio opened this issue Mar 21, 2022 · 4 comments
Open

proper names overrunning top word list #23

matanninio opened this issue Mar 21, 2022 · 4 comments

Comments

@matanninio
Copy link

In the word for 20220320, which was "קרקס", what seems to be the majority of the close words had been proper names of people and fictional charterers. Such words should, generally, not appear in the word list in the first place. As removing them may be an annoying issue (I suspect there should be an easy way to filter them reasonably with the pipeline), at least it is worth verifying that the list is not overrun by such words when selecting the daily words, as it can be very frustrating to guess such words.

@Iddoyadlin
Copy link
Collaborator

  1. hmm not sure about removing proper names from model, but can be nice if we had a parameter for that. not really sure how easy that would be though.
  2. the selection of daily words is currently done manually. there's a script in the repository called scripts/pick_random_word.py. usually the random words are not very interesting though... it would be very cool if we had a mechanism for selecting "interesting" words automatically given a gensim model. @matanninio maybe this should be the actual issue? do you have any ideas here maybe?

@ishefi
Copy link
Owner

ishefi commented Mar 21, 2022

a small correction to @Iddoyadlin 's answer: scripts/pick_random_word.py was removed and its logic is now part of scripts/set_secret.py if you don't provide a --secret. However, these are some of the words it suggests when it sets to select a random word from the db:

image

@matanninio
Copy link
Author

HebPipe has named-entity recognition, which should be a rather good starting point.
You can check the English version, which computes the scores for names, but does not include them in the "top 1000" part.
Consider for example rot13("Dnffnz") for the 24th or March word. I got something like "16 Dnffnz 56.56 ????" - the word would have had a position of about 996.5/1000, but it does not. I would suggest a similar approach.

@matanninio
Copy link
Author

image

Are these in the top 5k words in wikipedia? פטזאי for instance, appears less then 24 times in wikipedia, 19 of them in one entry. Even if you do not filter for proper names/named entities, you should probably not even consider words that are so rare for target words.

again, from the English version: "A: I grabbed a random list of the "most popular" 5,000 words in English, and removed anything capitalized or with hyphens, and the word2vec stopwords ("and", "if"). Then I shuffled it."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants