-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is anyone working on 'conditional' logit biasing? (How to potentially improve repetition penalty?) #3149
Comments
This seems super, super hard to do. My advice would be to get something you're satisfied with, even if there's a fairly steep performance penalty and then worry about trying to find an optimized way to accomplish the same thing. Even with rewinding, I'm find it's still pretty hard to stop the repeated sequences I want to stop and avoid hitting the ones that are fine. Also, once batch generation stuff is in I think we can actually avoid the performance penalty most of the time. We can just do batch generation and supposedly even something like 20 variations doesn't have much of a performance penalty. At that point, once there's a repeated sequence, instead of rewinding and banning the token that started it, we can just choose a variation where it didn't get picked. Even without that, the performance issue with rewinding doesn't really seem that steep. Maybe 20%? That's definitely something I could live with if I could actually get it working the way I want. |
Yeah, I don't blame you for taking the route you did at all, and I might not be able to pull off a worthwhile 'alternative' for the time being just by extrapolating off single starter tokens. I'm considering just doing a different version of Maybe 'negative bias only gets reinforced if it's part of the next token's pool/when it gets generated' isn't possible in the way I'm expecting though. If it works, then I'll likely try my hand at elaborating on it a bit with my intial idea here, but it's possible I'm overestimating myself hard here... it's pretty intimidating stuff lol It's nice to hear that batch generation should help with drafting though |
I'm not sure I fully understand what you're describing, but if you penalizing the start of the sequence then that's a sledgehammer. You're stopping everything that could start with that token. If you do something like progressively increase a penalty as more of a sequence is matched (basically how the seqrep stuff can work in non-rewind mode) then you can basically ban a token and leave the model with no good alternatives. That's kind of what the word boundary aware stuff is trying to improve. Just as example, if you were looking to stop "the dog's toy is blue" and you let the LLM generate "the dog'" and basically ban "s" at that point, you're pretty much going to get nonsense. There's nothing it can pick after that point except "s" that makes sense. Or even something like the word in your example, "min istr ations". If you get to "ministr" and ban "ations" there are very few possibilities it can choose that even make a real word and something like "ministry" probably isn't going to fit the context. |
What I was getting at with the 'rarity' statement was, I thought partial bias based on how common token parts were could significantly help minimize the risk of high perplexity outputs instead of linearly biasing for all tokens. The kind you mention where you force it to complete half of a word it's no longer allowed to create. I was thinking of manually calculating / estimating that 'popularity' based on a dictionary or some other resource if necessary, and then you could store percentage estimates of the most sensitive parts for the token pieces. The 'blacklist' vs 'whitelist' style behavior where it analyzes the top probable tokens is also something I still wanna do if it's reasonable. There might be way more overhead than I'm expecting towards checking and changing logit probability on the fly like that though. I was planning on referencing the code for the repetition penalty to see how it updated probability, but I could be wrongly assuming that biasing logits during generation would be higher performance compared to your rewind drafting Also, I'm not too familiar with seqrep; if there are obvious mistakes I'm making that I'm still ignoring when it comes to this, lmk. Overall I'm still learning and throwing ideas at the wall atm, but I appreciate the feedback |
Alright, so here's my gameplan for this:
If that's impossible, the alternative is gauging the last X tokens generated and banning if there's other context clues; which I imagine would be less reliable, but it'd be better than not being able to bias sequences whatsoever...
By itself, this won't be very valuable. For example, when banning But then I will replace it with a renamed option: So, for example:
Maybe it could be set up so that it biases proportional to how many synonyms are seen (e.g, 0% negative bias without |
Yes that's basically how it works. You evaluate the model for a step and you get the logits back. "logits" are basically a big array of floating point numbers, one for each token id the model understands where a higher value means more likely. One thing to note though is that while the values mean something in a relative sense, they're kind of arbitrary and may vary between models. That's what stuff like softmax is for - it scales them to a more predictable value. For example, a sampler like top-k works by sorting that list and keeping the top K items. Various samplers like biases, top-k, etc run after evaluation and then using some method you pick an actual token. Then the next time you evaluate a stop of the model you feed that token id you chose into it and get the next set of logits. This process repeats. So the manipulating what token gets chosen part really isn't hard. The big problem is making a decision of what to do when you only have the tokens that have been generated so far. Also, regardless of how you scale logits or whatever - the end result for a particular logit is binary: it gets picked or it doesn't. |
Well I know you can see the evaluation after generation; my concern was whether or not you could evaluate during generation and change it before those probabilities are picked from. Which I guess that's true and it's the sampler's job (which is how Mirostat's top_k can be changed on the fly) so that's not of concern. And yes, the end result is binary for each token. The proportional bias I mention at the end would just be deciding how much to bias against the token (globally?) based on what other tokens are seen in the logit list. The percentage would be a grade of how likely it is that this token is being used for a longer sequence that the user doesn't want; and so due to that uncertainty, it should scale based on how likely it is 'min' will become 'ministrations' instead of 'minimalist'. So what you could do is have a 'bias generation script' of some kind, which will parse the differences between the logit list of 'min' in the context of "The nurse's..." compared to the context of "His room wasn't complex and was..." and figure out which tokens are exceedingly probable for 'ministrations' generations compared to 'minimalist'. That way you can bias less universally and only contextually. Are you understanding what I'm getting at now? If the custom sampling isn't that hard to implement, it could be a very cost effective way to bias against longer words based on context clues without regeneration. |
Not really (not to say it's 100% impossible but there's nothing like that currently and it's generally a hard problem).
Yes, basically. You influence the logits you get when you evaluate the model with the token you picked on the previous step (or the prompt if you haven't had the model evaluate any of its own tokens yet).
It's going to be model specific and language specific and very likely also context specific as well. It seems like it's going to require so much effort constructing the rules that you might as well just write the thing yourself at that point. There are just so many words and permutations I just don't see how one could cover enough of them.
I actually don't really understand why someone would want to do that. My own efforts are aimed at trying to encourage less repetitive and more creative/diverse output. What's the use case for biasing against longer words? |
See the image attached for an explanation. It's not something you need ML for, and yes, it would be complex for a person to do it by hand; so why not make a script that will make the best whitelist/blacklist for you? And then weigh proportionally? I plan to make a script to do that on my own, and then I will begin attempting the implementation on my own; if you're not interested, that's fine, as you already have enough on your plate. But if you think I shouldn't attempt this idea because something else is preventing it from working, do tell, but it seems like this should be doable if I focus on it independently.
It's not about longer words. It's about being able to bias any word or short sequence (or bias positively, which I might explore later), in a way that is contextually aware. The potential use cases for that are very powerful for controlling how llama behaves outside of 'prompt engineering'. So not exclusively a 'better' repetition penalty. Also, you can already guide sampling to fit arbitrary grammar rules for things like JSON output. I wonder if I should look into that PR and what it's doing for a better understanding of how I could improve model guidance... |
Try it. :) I feel like this is going to be very difficult but certainly it would be great if you can prove me wrong.
No, no, I would never say anything like that. Even if you don't succeed, you'll probably still learn a lot by trying.
If you can find a practical way to do it, I agree, that's very useful. I'd really suggest not worrying too much about the runtime performance. Like, if you could find a way to do that even with rewinding where you will have much, much more information available it would still be great. Just for example the Classifier Free Guidance stuff halves performance and people still think it's worthwhile. So you could rewind to the extent that you halve performance and it still could be useful. In reality, it's not likely you'd need to rewind anything close to that much.
If I remember correctly, it basically works by banning all tokens that don't conform to the grammar. It doesn't really guide the model so much as prevent generation of anything that won't fit. |
I'd like to comment on this myself if that's no problem. First, I'd like to know if this is going well or how much work as been done on this issue. Secondly, I personally dislike the idea of having a static list of synonyms for alternative words. Like Kerfuffle mentioned, this will ultimately make the list language specific, so would need to be done for all languages. Additionally, this does not cover cases where you're not biasing against whole words or phrases. If I wanted to bias against Instead, I would propose to use the earlier mentioned idea of batched generation. You would introduce new settings/arguments, something like My method (and I have absolutely NO IDEA whether this is performant, or memory intensive or anything or if it's even practical, just throwing it out there) would consider a certain number of previous tokens as well in evaluating the logit bias. This would eliminate the awkward moments where the LLM generates text leading up to its use of a certain word, only for it to then be told "Uh well, pick a different one.", only for there to be no suitable alternative. Consider the text generation examples from earlier, in particular This phrase including We finished running our 13th generation, and we still have 10 iterations of 20 batches in memory, what now? Well, because we only kept 10 iterations, the tokens
With the previously thought of method of logit biasing (as far as I understand how it would make most sense), if the first option was ever generated, any other alternative for However, with this historical knowledge, we can instead pick a different leading phrase and avoid this awkward situation all together. We might instead go for the other phrase, which avoids any lead-up to the word How the score itself is generated, or how the bias mathematically would work, I don't know. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Feature Idea
If we want to efficiently bias against specific words that are made up of multiple tokens or 'ban' them, as well as short phrases, checking the logit list to see if the other predictions imply the 'full word' or 'full phrase' could be very beneficial. Currently, there is a limitation of predicting single tokens at a time; this means the decision on whether or not to pick a token based on context clues (e.g short synonyms instead of the first piece of a larger word) would be beneficial as there would be no overhead from 'rewinding' or reprocessing context.
A related draft PR exists which is dedicated towards implementing a 'rewind' feature for a sequence repetition penalty option. This could be very beneficial for longer phrases that can't be accurately 'predicted' ahead of time:
#2593
But I don't see any PR that attempts to tackle the issue in a way that doesn't incur performance overheads of some kind from having to regenerate tokens.
I have visually drafted out this conditional biasing concept in hopes that anyone working on a similar feature might be willing to help on this idea.
In addition, you could theoretically implement this in such a way where if you are biasing against a continued phrase or sentence, you gradually bias it for each consecutive word. For example, let's say you want to avoid this sentence from being referenced in any way:
Individually, these could still be considered typical tokens; the bias would only be introduced if a repeated sequence order is seen based on the frequency of those words.
"The" by itself shouldn't be impacted for obvious reasons; but a small bias against 'quick' could be introduced if the word preceding it was 'The'. For 'brown', you could bias the probability more aggressively and so on.
For every token that is breaking out of the 'banned sequence', you could ease off the biasing until it returns back to zero.
Doing this by hand would be tedious; maybe an automatic calculation that judges the rarest portions of the 'banned phrases' and weighs them proportionally to the rest of the temperature would be a better move for a 'phrase ban list'?
In addition, it doesn't necessarily have to be followed exactly in order to trigger the 'ban' as you could proportionally penalize more generic phrases like 'jumps over the' less than others. 'quick brown fox' might have a stronger negative bias, for example.
The text was updated successfully, but these errors were encountered: