-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of a sequence repetition penalty sampler #2593
base: master
Are you sure you want to change the base?
Implementation of a sequence repetition penalty sampler #2593
Conversation
ab27956
to
cb02274
Compare
Here is the current commandline help to get an idea of usage. Note that it is under The settings are also described (in somewhat more detail) in
I tested that the emulation mode examples actually produced the same results as the original samplers. One difference with seqrep however is that |
17eaa96
to
3ca1717
Compare
I'll look into this later, but most likely it will be merged after #2398 |
84ee695
to
0875559
Compare
Going to set this as a draft for the moment since I still need to double check to ensure everything still works after rebasing for the GGUF stuff. Feedback is still very welcome. |
a8fd7d6
to
33d6ce0
Compare
Current statusUpdated for the recent changes. I also added some tests. The tests cover the repetition, frequency/presence penalty emulation and the basic seqrep functionality. There are a lot of possible options that aren't covered yet. I'm not sure I'd say it's quite ready to merge since I'd really like to get some kind of feedback. I do think it's ready to review and test and it seems to improve output (at least some types of content - I wouldn't use any presence/repetition type penalties for stuff like code generation). The sampler is pretty large and complex but it's completely opt-in and the options don't even clog up the commandline help. Commandline Help ExampleHere's a copy of the help (from using (Click to expand)
The PitchEmulating Existing Penalty SamplersIt's possible to specify the Seqrep ModeThe full seqrep mode can be used to discourage repeating sequences that already exist in the prompt/generated tokens. Let's say we have these tokens and the minimum match length set to 3:
If the token 4 was generated next, this would repeat an existing sequence. To prevent repeating the sequence, we'd want to penalize token 4 and (hopefully) encourage more varied output. The sampler also has a
The above could also count as a match. Word Boundary AwarenessThe sampler also accepts a floating point The sampler has logic to try to identify whether the token to potentially be penalized is at a word boundary. On the other hand if last tokens ended with Aside from the seqrep stuff, I think this is an interesting idea to explore for other types of samplers like temperature for example. Applying more randomness to the start of words rather than syllables in the middle or something like a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to gather some feedback and see how useful the change is.
An alternative option that I would not have a problem to merge is to implement this as a basic example instead of part of llama.cpp
. For example, it can be part of common
- directly in common/common.cpp
or a new common/experiments.h/.cpp
or common/staging.h/.cpp
and enable it in main
. If the approach proves useful and finds adoption, we can merge it in llama.cpp
and maintain it long-term
Excuse me if it's inappropriate to chime in. I'm more of an end user, and noticed that the LLM (in my case nous-hermes-llama2-13b.ggmlv3.q4_K_M) is oddly inclined to literally repeat word by word a previously llm-generated summary text included as 'context info' in the prompt, while completely ignoring the main text it was instructed to summarize as per the prompt instruction. My intuition is that the LLM is overly sensitive to its own style of text in the sense that it triggers a very high signal mathematically speaking. Even more so because the original text I'm trying to summarize is a audio-to-text transcription from a spoken interview, which is obviously less structured and more chaotic. This PR seems to tackle this issue, right? Particularly this:
Since I'm interested in this kind of summarization chain (where a preceding llm-generated summary text is included in the prompt along with the new text to be summarized), I'm interested in running tests with this proposed solution. While I'm attempting a go at it, what's a good starting point regarding combination of parameters for this use case? |
100% the opposite, this kind of comment is exactly the kind of thing I'm looking for!
Yes, pretty much that's exactly the idea. I haven't tried anything with summaries, but I sometimes like to mess around with having the LLM generate creative stuff. One thing that really annoyed me was how it would just repeat sections from the prompt or what it had previously generated pretty much verbatim.
That's a hard one to answer because even for my own use case I'm not really sure what combination is best. I haven't tried summarizing. I can give you some places to start, but I'm not sure how they'll work for your use case. One thing you could also try (potentially using the examples as a base) is to change the values to an extreme and see how it effects the output. That will help get a feel for how the settings affect output.
Pretty simple one to get started. This only will lightly penalize tokens that are considered mid word. If you want to see how the mid word scale stuff is affecting output, you can set
This is one that seems to work pretty well for creative stuff, possibly also with Note that when Since it's a little unintuitive, after I think this approach can help encourage more varied output and avoid repeating sequences but there are definitely limitations. One thing is that once a sequence is identified, you (currently) can only penalize the tail of the sequence. You can't say "Stop repeating stuff, start over and write it in a more creative way". You only get to say "no" to the last token. I wrote a little more about that over here: #2888 (comment) Anyway, thanks for testing and don't worry about annoying me with questions or anything like that. (Although there's no guarantee I'll be able to give you a good/useful answer.) |
33d6ce0
to
64caa4f
Compare
Weird thought - and feel free to let me know if this is too tangential. What about a backtracking approach? That is: Once you have detected a repeated sequence, with some probability stop, back up to the start of the sequence, and retry from there. The advantage of this is that it results in an essentially unbiased distribution even for longer sequences - if I'm penalizing 'Penalize tokens that aren't the start of the word' is a good heuristic, but still fails occasionally. An obvious disadvantage of this is that it can result in significant backtracking and tail latency, and either requires speculative results or delaying token outputs until you know they aren't the prefix of any repeat. |
I'm not sure if you noticed, but I'm actually the one who started that topic! The reason was actually exactly this thought: how to enable a backtracking approach where it cuts off the start of the sequence rather than only being able to say no to the token that would continue it. If I say your idea is genius, that's basically the same as saying mine was too. Right? So let's go with that. :) |
Hi, I've been keeping an eye on this PR as I've been having similar thoughts about longer sequence similarity penalties and backtracking. I was contemplating a proof-of-concept backtracking implementation in koboldcpp (which is what I mainly use and is downstream of llama.cpp). In my own experiments doing a lot of guided narrative writing tasks with instruction models, I've noticed that the repetition seems to occur on a paragraph level (ie. separated by \n\n) with the models I'm mostly using, and I find that the paragraphs before and after within the same generation still generally end up completely different. My thinking is that if you could define a boundary to operate on and backtrack to (in my case, \n\n), this could maybe work very well. |
@ycros I'll keep your approach in mind, but I'm going to start out using the existing sequence detection code and just changing what happens when there's a sequence. I think I can use that and then just seek back to the longest observed sequence and ban the token it started with. I haven't had a chance to mess with with that yet and I also need to make Mr. GG's advice and reorganize the structure so this isn't in |
64caa4f
to
0ae4c07
Compare
How many times will I toggle this pull between draft and ready? Tune in to the next exciting episode to find out! There's still a decent amount of cleanup work to do after restructuring the code. It also doesn't support cmake yet. When using The main feature I want to add is a way to rewind |
Very rough, but rewind mode pretty much works now and seems to do a decent job of encouraging diverse input. Still a lot of cleanup to do.
Model tested is Note: Only supports |
* add safetensors to convert.py help message * Check for single-file safetensors model * Update convert.py "model" option help message * revert convert.py help message change
* Add support for stablelm-3b-4e1t * Supports GPU offloading of (n-1) layers
Co-authored-by: Jared Van Bortel <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Bernhard Gstrein <[email protected]>
…#4040) * gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode. * Respect add_bos_token GGUF metadata value * gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time
Oh no, this is why I was always scared of merging instead of rebasing! |
If you merge and then choose to rebase, you need to rebase onto a commit that includes the commits that were already merged, or use It's easier if you stick to one or the other for the most part, and not alternate... |
Thanks, I think this is what I needed. Actually, it probably wasn't even an issue with merging. I went to rebase to try to get rid of the merges (I had several in a row that I hadn't pushed) and I aborted out of the |
Automatically clear completed sequences out of the KV cache
If you quit without saving it will do the rebase anyway. It will only not do the rebase if your editor returns a nonzero status, e.g. |
why was this closed and what will, happen now? |
Were you actually using it? As far as I knew, only a few people ever even tried it and it was also a long, long way from being mergeworthy. Probably would just need to be rewritten entirely, not sure I have the time and energy to invest into doing that at the moment. More feedback was also necessary to understand what features people were using, what didn't work well, etc. I may maintain the branch in my fork mostly for my own use. |
@KerfuffleV2 If you can, please reopen it. I havent used it but the idea behind it is worth it to try, ill try it if its reopend again! |
@KerfuffleV2 There may be a chicken-and-egg problem here. This looks really neat, and I've been following it since you originally posted it. However, without a more convenient UI, I've been putting off testing and hoping it gets simplified. There are too many knobs to tweak. I wonder how hard it would be to add this to example/server? With all the parameters, likely difficult. That said, the concept is solid. I'm optimistic this will still matter in the future, even if you stop maintaining it. @IkariDevGIT you can try it even though it's closed. On their fork |
@crasm I know, but i want him to continue working on it here, thats why i said "ill try it if its reopend again!". Dont want this idea to die again. |
I'm reopening for visibility - I'd like to see more feedback on this, even if it remains a demo for now. |
Language models need this. Too much GPT fluff in the phrasing. |
Agreed, please don't let this PR die. |
Note
Scroll down to the bottom to read recent progress... I've been very bad about updating this initial post.
edit: Rewriting this description since there have been major changes.
This pull implements a sequence repetition penalty sampler. The aim is to penalize repeating sequences rather than individual tokens. It's sufficiently flexible that it can emulate the existing repetition, frequency and presence penalty samplers.
I also made it possible to add multiple copies of the sampler. So, for example, you could try to match again long sequences with a window of the entire context but apply only a weak penalty and add another instance that matches shorter sequences and applies a stronger penalty. This approach can also be used with the frequency/presence/repetition modes.
It can also support attempt to detect and adjusting the penalty for mid-word tokens. Why is this useful? Just for example, with LLaMA
fox
is tokenized asfo
,x
. When the LLM wants to talk about foxes, it generatesfo
and you say "No, you can't generate anx
now!" weird stuff can happen. From my experimentation, applying a weaker penalty for mid-word tokens (or disabling it altogether) produces better results than applying the full penalty.Right now, configuring this is only implemented for the
main
example.It includes a lot of settings for tuning sequence matching primarily. Without feedback, it's really hard to tell what's going to be most useful so I basically threw in the kitchen sink.
It doesn't really affect other stuff unless explicitly enabled so theoretically this could be merged as-is, however more testing and discussion would really be preferable. I was hoping to get some feedback about my approach and implementation.
I think in its current state, it actually be useful. From my testing, it actually is pretty useful at stopping the LLM from repeating sections of the prompt or its own output which seems to happen a lot with longer generations.