proper names overrunning top word list #23

matanninio · 2022-03-21T10:44:43Z

In the word for 20220320, which was "קרקס", what seems to be the majority of the close words had been proper names of people and fictional charterers. Such words should, generally, not appear in the word list in the first place. As removing them may be an annoying issue (I suspect there should be an easy way to filter them reasonably with the pipeline), at least it is worth verifying that the list is not overrun by such words when selecting the daily words, as it can be very frustrating to guess such words.

Iddoyadlin · 2022-03-21T12:09:44Z

hmm not sure about removing proper names from model, but can be nice if we had a parameter for that. not really sure how easy that would be though.
the selection of daily words is currently done manually. there's a script in the repository called scripts/pick_random_word.py. usually the random words are not very interesting though... it would be very cool if we had a mechanism for selecting "interesting" words automatically given a gensim model. @matanninio maybe this should be the actual issue? do you have any ideas here maybe?

ishefi · 2022-03-21T12:53:58Z

a small correction to @Iddoyadlin 's answer: scripts/pick_random_word.py was removed and its logic is now part of scripts/set_secret.py if you don't provide a --secret. However, these are some of the words it suggests when it sets to select a random word from the db:

matanninio · 2022-03-24T09:09:40Z

HebPipe has named-entity recognition, which should be a rather good starting point.
You can check the English version, which computes the scores for names, but does not include them in the "top 1000" part.
Consider for example rot13("Dnffnz") for the 24th or March word. I got something like "16 Dnffnz 56.56 ????" - the word would have had a position of about 996.5/1000, but it does not. I would suggest a similar approach.

matanninio · 2022-03-24T09:20:01Z

Are these in the top 5k words in wikipedia? פטזאי for instance, appears less then 24 times in wikipedia, 19 of them in one entry. Even if you do not filter for proper names/named entities, you should probably not even consider words that are so rare for target words.

again, from the English version: "A: I grabbed a random list of the "most popular" 5,000 words in English, and removed anything capitalized or with hyphens, and the word2vec stopwords ("and", "if"). Then I shuffled it."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proper names overrunning top word list #23

proper names overrunning top word list #23

matanninio commented Mar 21, 2022

Iddoyadlin commented Mar 21, 2022

ishefi commented Mar 21, 2022

matanninio commented Mar 24, 2022

matanninio commented Mar 24, 2022

proper names overrunning top word list #23

proper names overrunning top word list #23

Comments

matanninio commented Mar 21, 2022

Iddoyadlin commented Mar 21, 2022

ishefi commented Mar 21, 2022

matanninio commented Mar 24, 2022

matanninio commented Mar 24, 2022