-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proper names overrunning top word list #23
Comments
|
a small correction to @Iddoyadlin 's answer: |
HebPipe has named-entity recognition, which should be a rather good starting point. |
Are these in the top 5k words in wikipedia? פטזאי for instance, appears less then 24 times in wikipedia, 19 of them in one entry. Even if you do not filter for proper names/named entities, you should probably not even consider words that are so rare for target words. again, from the English version: "A: I grabbed a random list of the "most popular" 5,000 words in English, and removed anything capitalized or with hyphens, and the word2vec stopwords ("and", "if"). Then I shuffled it." |
In the word for 20220320, which was "קרקס", what seems to be the majority of the close words had been proper names of people and fictional charterers. Such words should, generally, not appear in the word list in the first place. As removing them may be an annoying issue (I suspect there should be an easy way to filter them reasonably with the pipeline), at least it is worth verifying that the list is not overrun by such words when selecting the daily words, as it can be very frustrating to guess such words.
The text was updated successfully, but these errors were encountered: