Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language Model Redesign #268

Merged
merged 282 commits into from
Mar 30, 2023
Merged

Language Model Redesign #268

merged 282 commits into from
Mar 30, 2023

Conversation

dcgaines
Copy link
Collaborator

@dcgaines dcgaines commented Mar 13, 2023

Overview

This PR reworked the existing GPT-2 language model implementation to fix several critical bugs. The new CasualLanguageModel class fixes these bugs and allows for the insertion of any causal model from HuggingFace (or a locally trained one). This PR also includes the KenLMLanguageModel class that implements an n-gram model and the MixtureLanguageModel class that allows for the mixture of two or more other models.

Ticket

https://www.pivotaltracker.com/story/show/183975978
https://www.pivotaltracker.com/story/show/184017969
https://www.pivotaltracker.com/story/show/184589162
https://www.pivotaltracker.com/story/show/184365440

Contributions

  • lm_eval.py script for the evaluation of perplexity and prediction time for any given language model
  • mixture_tuning.py script for the optimization of mixture weights for use in the MixtureLanguageModel
  • CausalLanguageModel class, a generic language model class that can be used with many different causal models from HuggingFace
  • KenLMLanguageModel class, an n-gram language model class
  • MixtureLanguageModel class, which allows for the mixture of the outputs of two or more other models
  • Creation of lm_params.json to accommodate the usage of the new language model classes. This file is copied into the session data directory alongside parameters.json
  • Deprecated old GPT-2 implementation due to critical bugs. It is replaced by the CausalLanguageModel class, which can use gpt2 from HuggingFace

Test

  • Wrote test suites for each language model class based on the pre-existing tests for the previous GPT2LanguageModel class. Expanded tests where necessary to ensure proper performance.
  • Tested locally using a fake LSL server
  • Tested on new home use machine by OHSU development and clinical teams.

Documentation

  • Updated the language module README as well as a few slight adjustments to the overall README.

Changelog

  • Is the CHANGELOG.md updated with your detailed changes? Not yet

kdv123 and others added 30 commits December 12, 2022 16:07
…d of phrase-level averages. Added variation and min/max to timing calculations.
…t does. Added it as an option to the Mixture model
…s. Instead just keep a list of log-probs for each character possibility. No changes in prediction probabilities with a slight decrease in prediction time.
…tswith check since now we force that to be true. Only convert single space characters to SPACE_CHAR, not all whitespace.
… language/main. Improved efficiency of list appends in causal model. Pass current sequence string in the tuple instead of rebuilding each time.
Copy link
Collaborator

@lawhead lawhead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of work! Thanks for your effort on this. The main things I would like to revisit would be parameters and maybe the location of symbols. See my comments below.

bcipy/display/paradigm/vep/display.py Show resolved Hide resolved
bcipy/helpers/parameters.py Outdated Show resolved Hide resolved
bcipy/parameters/lm_params.json Show resolved Hide resolved
bcipy/language/model/mixture.py Outdated Show resolved Hide resolved
bcipy/language/model/causal.py Outdated Show resolved Hide resolved
bcipy/language/model/causal.py Outdated Show resolved Hide resolved
@dcgaines dcgaines self-assigned this Mar 20, 2023
@dcgaines dcgaines requested review from tab-cmd and lawhead March 26, 2023 22:05
Copy link
Contributor

@tab-cmd tab-cmd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small cleanups are needed, and there is a typo in lm params? Otherwise, this is ready to merge and iterate on! @lawhead will want to take another look

.github/workflows/main.yml Outdated Show resolved Hide resolved
bcipy/language/model/kenlm.py Outdated Show resolved Hide resolved
bcipy/language/tests/test_language.py Outdated Show resolved Hide resolved
.github/workflows/main.yml Outdated Show resolved Hide resolved
bcipy/language/model/kenlm.py Outdated Show resolved Hide resolved
"kenlm": {
"model_file": {
"description": "Name of the pretrained model file",
"value": "lm_dec19_char_large_12gram.kenlm",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*.arpa?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've shifted to using the binary .kenlm files instead of the .arpa files since they have faster model load times. I believe I already updated all the documentation to point to the proper .kenlm file location

@dcgaines dcgaines merged commit a740bba into 2.0.0rc3 Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants