Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove generic tokenizer and support multiple languages for the word cloud. #388

Merged
merged 7 commits into from
Jan 26, 2023

Conversation

gabegma
Copy link
Contributor

@gabegma gabegma commented Jan 24, 2023

Description:

I investigated if the generic tokenizer was still useful. It turns out it wasn't really, so I made the following changes:

  • There are no more dummy saliencies (equal to 0) that are computed when saliency is not available. It was counter-productive to compute them since they were not used in the UI, and for the top words, the words were converted to sub-tokens, and then back to words.
  • Because of that, I added saliency as an optional startup task.
  • Furthermore, I needed to refactor the top words code, so it no longer calls the dummy saliency method when saliency is not available. This turned out to be an opportunity: I now use the Spacy model (which, as we know, can be defined per language), which already has its own list of punctuation chars (is_punct) and stop words (is_stop). The chars are a bit different from the ones we had, but as you can see in the tests, the differences seem ok to me. Plus, now it will work for french :).
    • However, I needed to add a char replacement for this to work. The English Spacy model did not recognize "`" as a punctuation sign. I believe this also caused problems that we weren't seeing when counting tokens or performing POS tagging. When replacing it by "'", this slightly changed the values in the dataset warnings (token count).
      • Example from sst2 with the "`":

      designed to provide a mix of smiles and tears , `` crossroads '' instead provokes a handful of unintentional howlers and numerous yawns .

    • I needed to change the scope of TopWordsModule from ModelContractConfig to AzimuthConfig, since it now relies on both the model or the syntax config, depending if saliency is available. AzimuthConfig is a bit too broad, but since this module computes fast, I think that it is ok.

Checklist:

You should check all boxes before the PR is ready. If a box does not apply, check it to acknowledge it.

  • ISSUE NUMBER. You linked the issue number (Ex: Resolve #XXX).
  • PRE-COMMIT. You ran pre-commit on all commits, or else, you
    ran pre-commit run --all-files at the end.
  • USER CHANGES. The changes are added to CHANGELOG.md and the documentation, if they impact
    our users.
  • DEV CHANGES.
    • Update the documentation if this PR changes how to develop/launch on the app.
    • Update the README files and our wiki for any big design decisions, if relevant.
    • Add unit tests, docstrings, typing and comments for complex sections.

@gabegma gabegma self-assigned this Jan 24, 2023
@gabegma gabegma changed the title Remove generic tokenizer Remove generic tokenizer and support multiple languages for the word cloud. Jan 24, 2023
Copy link
Contributor

@JosephMarinier JosephMarinier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool cool cool! 😚👌 I only have some small comments.

@gabegma
Copy link
Contributor Author

gabegma commented Jan 25, 2023

@JosephMarinier @lindsaydbrin actually, I realized that I could create a specific config scope for TopWords. See my last commit. Does that make sense to you?

@lindsaydbrin
Copy link
Contributor

@JosephMarinier @lindsaydbrin actually, I realized that I could create a specific config scope for TopWords. See my last commit. Does that make sense to you?

If it works, seems like a more direct approach! It seems like if we do this a lot, we could have a tangled mess of custom config pairings, but we can deal with that if it happens. But anyway, generally, I'd defer to @JosephMarinier on this.

Copy link
Contributor

@lindsaydbrin lindsaydbrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - thanks for taking care of this, and for such helpful Description context! Some small edits on comments, one small question mostly for my understanding, and I'll let Joseph comment on the config scope change. It was already approved, but here's another anyhow. 😆

@gabegma gabegma merged commit 1e0c33e into main Jan 26, 2023
@gabegma gabegma deleted the ggm/remove-tokenizer branch January 26, 2023 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants