Language Model Redesign #268

dcgaines · 2023-03-13T19:03:25Z

Overview

This PR reworked the existing GPT-2 language model implementation to fix several critical bugs. The new CasualLanguageModel class fixes these bugs and allows for the insertion of any causal model from HuggingFace (or a locally trained one). This PR also includes the KenLMLanguageModel class that implements an n-gram model and the MixtureLanguageModel class that allows for the mixture of two or more other models.

Ticket

https://www.pivotaltracker.com/story/show/183975978
https://www.pivotaltracker.com/story/show/184017969
https://www.pivotaltracker.com/story/show/184589162
https://www.pivotaltracker.com/story/show/184365440

Contributions

lm_eval.py script for the evaluation of perplexity and prediction time for any given language model
mixture_tuning.py script for the optimization of mixture weights for use in the MixtureLanguageModel
CausalLanguageModel class, a generic language model class that can be used with many different causal models from HuggingFace
KenLMLanguageModel class, an n-gram language model class
MixtureLanguageModel class, which allows for the mixture of the outputs of two or more other models
Creation of lm_params.json to accommodate the usage of the new language model classes. This file is copied into the session data directory alongside parameters.json
Deprecated old GPT-2 implementation due to critical bugs. It is replaced by the CausalLanguageModel class, which can use gpt2 from HuggingFace

Test

Wrote test suites for each language model class based on the pre-existing tests for the previous GPT2LanguageModel class. Expanded tests where necessary to ensure proper performance.
Tested locally using a fake LSL server
Tested on new home use machine by OHSU development and clinical teams.

Documentation

Updated the language module README as well as a few slight adjustments to the overall README.

Changelog

Is the CHANGELOG.md updated with your detailed changes? Not yet

…cript

…into lm_adding_ngram

…d of phrase-level averages. Added variation and min/max to timing calculations.

…into lm_adding_ngram

…ded HuggingFace model to the eval script

…into lm_adding_ngram

…t does. Added it as an option to the Mixture model

…l efficiency

…s. Instead just keep a list of log-probs for each character possibility. No changes in prediction probabilities with a slight decrease in prediction time.

… extending hypotheses

…tswith check since now we force that to be true. Only convert single space characters to SPACE_CHAR, not all whitespace.

… language/main. Improved efficiency of list appends in causal model. Pass current sequence string in the tuple instead of rebuilding each time.

lawhead

This is a lot of work! Thanks for your effort on this. The main things I would like to revisit would be parameters and maybe the location of symbols. See my comments below.

bcipy/display/paradigm/vep/display.py

bcipy/helpers/parameters.py

bcipy/parameters/lm_params.json

bcipy/language/model/mixture.py

bcipy/language/model/causal.py

…with custom exception. Added install instructions to language module README

…into lm_adding_ngram

…ed imports

…irectly from json

Merging MacOS fixes into LM branch

tab-cmd

Some small cleanups are needed, and there is a typo in lm params? Otherwise, this is ready to merge and iterate on! @lawhead will want to take another look

.github/workflows/main.yml

bcipy/helpers/tests/resources/mock_session/parameters.json

bcipy/language/model/kenlm.py

bcipy/language/tests/test_language.py

.github/workflows/main.yml

bcipy/language/model/kenlm.py

tab-cmd · 2023-03-29T20:15:46Z

bcipy/parameters/lm_params.json

+  "kenlm": {
+    "model_file": {
+      "description": "Name of the pretrained model file",
+      "value": "lm_dec19_char_large_12gram.kenlm",


We've shifted to using the binary .kenlm files instead of the .arpa files since they have faster model load times. I believe I already updated all the documentation to point to the proper .kenlm file location

kdv123 and others added 30 commits December 12, 2022 16:07

Examples playing around with different Huggingface transformers

17b2f67

tweak to bert training example

a4b48ea

Fine tuning BERT-like against several variants

3acd6b7

Created prototype GPT-2 replacement LM class from predict_gpt2 demo s…

3fa7320

…cript

Merge branch 'lm_adding_ngram' of https://github.com/CAMBI-tech/BciPy …

c70e3b1

…into lm_adding_ngram

Added basic prediction timing to the ppl_calc script

9592710

Added KenLM Large model to gitignore

151af1f

Changing timing calculations to average individual predictions instea…

e5d3973

…d of phrase-level averages. Added variation and min/max to timing calculations.

added batch size params

118fe26

Merge branch 'lm_adding_ngram' of https://github.com/CAMBI-tech/BciPy …

5f307c6

…into lm_adding_ngram

added batch size params

8e92017

Fine tuning BERT-like against several variants

0a61888

Fine tuning BERT-like against several variants

d525850

GPT fine tuning script

402527c

Added outlier flagging for prediction times outside of the 95% CI

3eea094

Merge branch 'lm_adding_ngram' of https://github.com/CAMBI-Tech/Bcipy …

431276c

…into lm_adding_ngram

Improved comments

39dc199

Renamed gptmk2 model to the huggingface model

028b37d

Renamed ppl_calc to lm_eval and moved to the BciPy scripts folder. Ad…

27b51d0

…ded HuggingFace model to the eval script

Merge branch 'lm_adding_ngram' of https://github.com/CAMBI-Tech/Bcipy …

4775b15

…into lm_adding_ngram

Renamed the "HuggingFace" model to "Causal" to better describe what i…

84a9e1b

…t does. Added it as an option to the Mixture model

Created a set of the first 20 phrases of the aac set for testing mode…

40a40e9

…l efficiency

Eliminated the 'done' array that tracked the string of each hypothesi…

1caa260

…s. Instead just keep a list of log-probs for each character possibility. No changes in prediction probabilities with a slight decrease in prediction time.

Added support for inference batching to causal model

7af60ec

Do not consider tokens that contain characters out of symbol set when…

c862ace

… extending hypotheses

Added support for variable token backoff

c07dc8f

Converted the 'valid' array to a priority queue to track best hypotheses

2eb4f09

Only search over tokens that will fit the context. Eliminate the star…

bcb7e1f

…tswith check since now we force that to be true. Only convert single space characters to SPACE_CHAR, not all whitespace.

Moved SPACE_CHAR, BACKSPACE_CHAR, and alphabet() from helpers/task to…

7390975

… language/main. Improved efficiency of list appends in causal model. Pass current sequence string in the tuple instead of rebuilding each time.

Updated import from helpers/task to language/main

539cb94

lawhead requested changes Mar 17, 2023

View reviewed changes

dcgaines self-assigned this Mar 20, 2023

tab-cmd force-pushed the lm_adding_ngram branch from 027901b to 28d65e4 Compare March 20, 2023 17:01

dcgaines and others added 8 commits March 22, 2023 12:42

Merge branch 'lm_adding_ngram' with rebase

ae8e247

Add InquirySchedule to imports

7cfef66

Update Matrix task import

fb479e3

Renamed InvalidModelException to InvalidLanguageModelException

e8be49d

Removed kenlm from requirements.txt. Added try/except around imports …

74b9a8d

…with custom exception. Added install instructions to language module README

Renamed InvalidModelException to InvalidLanguageModelException

a7c7bf9

Removed kenlm from requirements.txt. Added try/except around imports …

78eef4d

…with custom exception. Added install instructions to language module README

Merge branch '2.0.0rc3' into backup

ff65d67

tab-cmd force-pushed the lm_adding_ngram branch from 74b9a8d to ff65d67 Compare March 22, 2023 23:48

dcgaines and others added 7 commits March 24, 2023 00:42

Merge branch 'lm_adding_ngram' of https://github.com/CAMBI-Tech/Bcipy …

7801c37

…into lm_adding_ngram

Merge branch '2.0.0rc3' into lm_adding_ngram

9f2972b

Moved alphabet and special symbols to new helpers.symbols file. Updat…

54b9646

…ed imports

Cleaned up unused or inefficient code

829dfa9

Updated README with more details on module structure

696b9f2

Reverted changes to parameters module. Reworked lm params to import d…

2f404f7

…irectly from json

Removed unneccesary set file

66ea82e

dcgaines requested review from tab-cmd and lawhead March 26, 2023 22:05

dcgaines and others added 3 commits March 28, 2023 10:35

Attempt to fix kenlm install for integration testing

79c3e80

Fixed the global option parameter for integration test kenlm install

11b7d81

Merge pull request #274 from CAMBI-tech/2.0.0rc3

4daa3a9

Merging MacOS fixes into LM branch

tab-cmd approved these changes Mar 29, 2023

View reviewed changes

dcgaines added 4 commits March 30, 2023 08:18

Pinned kenlm version, added to install instructions in README

e8149af

Updated default lm parameter

40d2f9a

Cleaned up old caching code

8849508

Moved symbols tests to helper module

7f191e1

dcgaines merged commit a740bba into 2.0.0rc3 Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language Model Redesign #268

Language Model Redesign #268

dcgaines commented Mar 13, 2023 •

edited

Loading

lawhead left a comment

tab-cmd left a comment

tab-cmd Mar 29, 2023

dcgaines Mar 29, 2023

Language Model Redesign #268

Language Model Redesign #268

Conversation

dcgaines commented Mar 13, 2023 • edited Loading

Overview

Ticket

Contributions

Test

Documentation

Changelog

lawhead left a comment

Choose a reason for hiding this comment

tab-cmd left a comment

Choose a reason for hiding this comment

tab-cmd Mar 29, 2023

Choose a reason for hiding this comment

dcgaines Mar 29, 2023

Choose a reason for hiding this comment

dcgaines commented Mar 13, 2023 •

edited

Loading