-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove generic tokenizer and support multiple languages for the word cloud. #388
Conversation
adde1d1
to
c0db735
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool cool cool! 😚👌 I only have some small comments.
@JosephMarinier @lindsaydbrin actually, I realized that I could create a specific config scope for TopWords. See my last commit. Does that make sense to you? |
If it works, seems like a more direct approach! It seems like if we do this a lot, we could have a tangled mess of custom config pairings, but we can deal with that if it happens. But anyway, generally, I'd defer to @JosephMarinier on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great - thanks for taking care of this, and for such helpful Description
context! Some small edits on comments, one small question mostly for my understanding, and I'll let Joseph comment on the config scope change. It was already approved, but here's another anyhow. 😆
Co-authored-by: Lindsay Brin <[email protected]>
Description:
I investigated if the generic tokenizer was still useful. It turns out it wasn't really, so I made the following changes:
is_punct
) and stop words (is_stop
). The chars are a bit different from the ones we had, but as you can see in the tests, the differences seem ok to me. Plus, now it will work for french :).TopWordsModule
fromModelContractConfig
toAzimuthConfig
, since it now relies on both the model or the syntax config, depending if saliency is available.AzimuthConfig
is a bit too broad, but since this module computes fast, I think that it is ok.Checklist:
You should check all boxes before the PR is ready. If a box does not apply, check it to acknowledge it.
ran
pre-commit run --all-files
at the end.our users.
README
files and our wiki for any big design decisions, if relevant.