Improving GPT2 language model prediction #207

sliu126 · 2022-03-09T17:14:12Z

Overview

A new beam-search based method for GPT2 character prediction. The language model can predict multiple worldpieces and marginalize the word-level prediction to make character level prediction.

Ticket

https://www.pivotaltracker.com/story/show/181349709

Contributions

Fixed the multiple spaces bug
Improved the overall (ranked-based) performance on the ALS phrase dataset
All letters in the alphabet (except for the backspace character) now have nonzero probability after interpolating with a unigram language model
Smoothed the language model distribution with an exponential rescaling factor

Test

Run all unit tests existed.

lawhead

I would like to see some additional unit tests added to demonstrate the improvements. For instance:

Fixed the multiple spaces bug; you could test that the probability for space character returned after a space has been typed is some small value (or at least smaller than the probability before; ex. "THE" vs "THE_").
All letters in the alphabet (except for the backspace character) now have nonzero probability after interpolating with a unigram language model; call predict for a word that previously returned 0 values and assert that all values are greater than 0.

lawhead · 2022-03-10T01:27:38Z

bcipy/language/model/gpt2.py

+                           'X': 0.0008, 'Z': 0.0005, 'Q': 0.0002, BACKSPACE_CHAR: 0.0}
+
+        # A uniform language model
+        self.uniform_lm = dict(zip(self.symbol_set, equally_probable(self.symbol_set, {BACKSPACE_CHAR: 0.0})))


I don't see where you're actually using self.uniform_lm anywhere. Maybe this is for debugging and can be removed? I also don't see any uses of the is_start_of_word attribute.

Yes, I have removed unused attributes.

lawhead · 2022-03-10T21:35:38Z

bcipy/language/model/gpt2.py


-    def predict(self, evidence: List[str]) -> List[Tuple]:
+    def predict(self, evidence: List[str], beam_width: int = 20, search_depth: int = 2) -> List[Tuple]:


These additional variables violate the API. They should be class attributes.

Okay. I changed them to class attributes.

lawhead · 2022-03-10T23:24:34Z

bcipy/language/model/gpt2.py


-    def __model_infer(self, text: str) -> List[float]:
+    def __rescale(self, lm: Dict[str, float], coeff: float):


Could this be a utility function in this module? It seems like it might be more generally useful.

I agree. Changed it to a static method.

lawhead · 2022-03-10T23:24:57Z

bcipy/language/model/gpt2.py

@@ -81,84 +182,122 @@ def __get_char_predictions(self, word_prefix: str) -> List[tuple]:

        return char_prob_tuples

-    def __build_vocab(self) -> Dict[int, str]:
+    def __interpolate_language_models(self, lm1: Dict[str, float], lm2: Dict[str, float], coeff: float) -> List[Tuple]:


This seems like it could be a utility function as well.

Also changed it to a static method.

lawhead · 2022-03-10T23:54:09Z

bcipy/language/model/gpt2.py

+
+            # sort the new candidates based on likelihood and populate the beam
+            ordered_candidates = sorted(new_candidates, key=lambda x: x[1], reverse=True)
+            beam = ordered_candidates[:beam_width]


It might be useful to factor out the beam search algorithm from the specifics of how we're using it. This should be generalizable.

Sure, I moved the beam search code to another method, self.__beam_search().

sliu126 · 2022-03-16T16:38:05Z

@lawhead, I added two unit tests that address your comments.

lawhead

Thanks for making the suggested changes. The predictions seem better than before, but I'm still surprised by how high the SPACE character is ranked when no letters have been typed (model.predict([])) and after a SPACE has just been typed (model.predict(list("THE_"))). This may be worth looking into further and writing some unit tests against.

tab-cmd · 2022-03-21T17:15:52Z

bcipy/language/model/gpt2.py

+
+        # Hard coding a unigram language model trained on ALS phrase dataset
+        # for smoothing purpose
+        self.unigram_lm = {'E': 0.0998, 'T': 0.096, 'O': 0.0946, 'I': 0.0835,


Would this fix our symbol set? This may be alright given how we want to use it in the short term, but we should validate the same set was passed as you define here.

This serves as a quick way to introduce a unigram language model for interpolating purpose. It does use the same symbol set as the default symbol set, alphabet(). I think for now this should be alright. If the symbol set is a parameter that would change, we probably need to train a unigram model on that symbol set on the fly, which may be undesirable. For now I have added a check to make sure that the unigram symbol set is the same as the one passed in.

tab-cmd · 2022-03-21T17:16:43Z

bcipy/language/model/gpt2.py

+                           'D': 0.0358, 'Y': 0.0324, 'W': 0.0288, 'M': 0.0266,
+                           'G': 0.0221, 'C': 0.018, 'K': 0.016, 'P': 0.0145,
+                           'F': 0.0117, 'B': 0.0113, 'V': 0.0091, 'J': 0.0016,
+                           'X': 0.0008, 'Z': 0.0005, 'Q': 0.0002, BACKSPACE_CHAR: 0.0}


I'm not sure the impact of setting a zero on BACKSPACE_CHAR?

Currently the language model sets zero probability on BACKSPACE_CHAR, but the copy phrase code would add a nonzero probability to it (I think it is 0.05). I think we can leave it this way for now.

tab-cmd · 2022-03-21T17:21:13Z

bcipy/language/model/gpt2.py

        self.lm_path = lm_path or "gpt2"
+
+        # Hard coding a unigram language model trained on ALS phrase dataset


If this is specific to the population or phrases used, it would be better to load these weights or smooth parameters in some other way. This overfits this model to the current experiment phrases. How would this change if we used a new dataset. What is the training procedure?

I think we can think of this as the same as loading a pretrained GPT2 language model -- we can certainly put the weights in some other files and load from there, but we still need to train offline and update those files, which is not much different from what we are doing now. For now I think it's okay to keep it as it is.

tab-cmd · 2022-03-21T17:23:13Z

bcipy/language/model/gpt2.py

+
+        # interpolate with unigram language model to smooth the probability distribution returned
+        # by GPT2 language model
+        next_char_pred = GPT2LanguageModel.interpolate_language_models(dict(next_char_pred), self.unigram_lm, 0.8)


bring hardcoded coeffecients up to init for easier configuration

Yes, I have done it.

…o gpt2_beam_search

sliu126 added 5 commits March 1, 2022 20:25

improved method for character prediction

1316d8b

set default beam width and search depth

f37da9f

add types to function arguments

e94362b

remove extra comments

fd34775

fix lint errors

53c1458

sliu126 requested review from lawhead and tab-cmd March 9, 2022 17:53

lawhead requested changes Mar 10, 2022

View reviewed changes

sliu126 and others added 2 commits March 16, 2022 12:30

address PR comments

984fb05

Merge branch '1.5.1' into gpt2_beam_search

a49eb92

lawhead approved these changes Mar 18, 2022

View reviewed changes

tab-cmd requested changes Mar 21, 2022

View reviewed changes

sliu126 added 4 commits March 21, 2022 21:03

address Tab's PR comments

7a2504c

Merge branch 'gpt2_beam_search' of https://github.com/BciPy/BciPy int…

a6df763

…o gpt2_beam_search

add unigram symbol set equal check

eefabc5

fix lint errors

97c1ad1

tab-cmd self-requested a review March 28, 2022 19:53

tab-cmd approved these changes Mar 28, 2022

View reviewed changes

sliu126 merged commit 2f99c0d into 1.5.1 Mar 28, 2022

sliu126 deleted the gpt2_beam_search branch March 28, 2022 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving GPT2 language model prediction #207

Improving GPT2 language model prediction #207

sliu126 commented Mar 9, 2022

lawhead left a comment

lawhead Mar 10, 2022

sliu126 Mar 16, 2022

lawhead Mar 10, 2022

sliu126 Mar 16, 2022

lawhead Mar 10, 2022

sliu126 Mar 16, 2022

lawhead Mar 10, 2022

sliu126 Mar 16, 2022

lawhead Mar 10, 2022

sliu126 Mar 16, 2022 •

edited

Loading

sliu126 commented Mar 16, 2022

lawhead left a comment

tab-cmd Mar 21, 2022

sliu126 Mar 22, 2022

tab-cmd Mar 21, 2022

sliu126 Mar 22, 2022

tab-cmd Mar 21, 2022

sliu126 Mar 22, 2022

tab-cmd Mar 21, 2022

sliu126 Mar 22, 2022


		def predict(self, evidence: List[str]) -> List[Tuple]:
		def predict(self, evidence: List[str], beam_width: int = 20, search_depth: int = 2) -> List[Tuple]:


		def __model_infer(self, text: str) -> List[float]:
		def __rescale(self, lm: Dict[str, float], coeff: float):

		self.lm_path = lm_path or "gpt2"

		# Hard coding a unigram language model trained on ALS phrase dataset

Improving GPT2 language model prediction #207

Improving GPT2 language model prediction #207

Conversation

sliu126 commented Mar 9, 2022

Overview

Ticket

Contributions

Test

lawhead left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sliu126 Mar 16, 2022 • edited Loading

Choose a reason for hiding this comment

sliu126 commented Mar 16, 2022

lawhead left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sliu126 Mar 16, 2022 •

edited

Loading