Implementation of a sequence repetition penalty sampler #2593

KerfuffleV2 · 2023-08-12T20:39:25Z

Note

Scroll down to the bottom to read recent progress... I've been very bad about updating this initial post.

edit: Rewriting this description since there have been major changes.

This pull implements a sequence repetition penalty sampler. The aim is to penalize repeating sequences rather than individual tokens. It's sufficiently flexible that it can emulate the existing repetition, frequency and presence penalty samplers.

I also made it possible to add multiple copies of the sampler. So, for example, you could try to match again long sequences with a window of the entire context but apply only a weak penalty and add another instance that matches shorter sequences and applies a stronger penalty. This approach can also be used with the frequency/presence/repetition modes.

It can also support attempt to detect and adjusting the penalty for mid-word tokens. Why is this useful? Just for example, with LLaMA fox is tokenized as fo, x. When the LLM wants to talk about foxes, it generates fo and you say "No, you can't generate an x now!" weird stuff can happen. From my experimentation, applying a weaker penalty for mid-word tokens (or disabling it altogether) produces better results than applying the full penalty.

Right now, configuring this is only implemented for the main example.

It includes a lot of settings for tuning sequence matching primarily. Without feedback, it's really hard to tell what's going to be most useful so I basically threw in the kitchen sink.

It doesn't really affect other stuff unless explicitly enabled so theoretically this could be merged as-is, however more testing and discussion would really be preferable. I was hoping to get some feedback about my approach and implementation.

I think in its current state, it actually be useful. From my testing, it actually is pretty useful at stopping the LLM from repeating sections of the prompt or its own output which seems to happen a lot with longer generations.

KerfuffleV2 · 2023-08-16T23:01:29Z

Here is the current commandline help to get an idea of usage.

Note that it is under -seqrep help so it doesn't clutter up the normal --help list.

The settings are also described (in somewhat more detail) in llama.h in the comments for the llama_sampler_seqrep_flags enum and the llama_sampler_seqrep_params struct.

==== Sequence Repetition Sampler Help ====

  The sequence repetition sampler takes a configuration string in the format:
  arg1:arg2:argN
  A colon separated argument can be a key value pair like xyz=1 or flag like xyz

- Available key/value arguments
  * repetition_mode=REPEAT_PENALTY
    emulates the repetition penalty sampler. warning: 1.0 disables penalties since this preset enables flag_divide_by_penalty. using 0.0 is probably not what you want
  * presence_mode=PRESENCE_PENALTY
    emulates the presence penalty sampler
  * frequency_mode=FREQUENCY_PENALTY
    Emulates the repetition penalty sampler
  * last_n
    last n tokens to consider for sequence penalizing (default: 256, 0 = disabled, -1 = ctx_size)
  * min_length
    minimum matching sequence length (default: 0, < 2 = disabled)
  * presence_penalty
    presence penalty for tokens that can continue a sequence (default: 0.000000, 0.0 = disabled)
  * length_penalty
    penalty for tokens that can continue a sequence, multiplied by length (default: 0.000000, 0.0 = disabled)
  * tolerance
    tolerance for fuzzy matching sequences (default: 0.000000, 0 = disabled)
  * mid_word_scale
    scale penalty when for mid-word tokens. 1.0 would mean apply the full penalty (default: 0.100000, 1.0 = disabled)
  * tolerance_match_credit
    credit tolerance on matched tokens (default: 0.000000, 0.0 = disabled)
  * tolerance_half_step_cost
    advanced option to adjust tolerance cost for failed matches within a half step of a match (default: 1.000000, 1.0 = normal)

- Available flags arguments (currently all default to disabled)
  * flag_immediate_wildcard
    when tolerance is consumed, by default it doesn't count as a match until a real match is found
  * flag_tolerance_no_consecutive
    do not allow using tolerance consecutively
  * flag_tolerance_no_first
    do not allow using tolerance before the first match
  * flag_tolerance_cap_initial
    only meaningful with match credit, prevents match credit adjusting tolerance higher than the initial value
  * flag_penalize_length_max_seen
    when applying length_penalty, use the maximum seen sequence length rather than the total length of seen sequences
  * flag_divide_by_penalty
    divide the logit by when applying a penalty rather than subtracting it. warning: when this flag is enabled, 1.0 disables penalties not 0.0. 0.0 is probably not what you want

- Examples:
  * repetition_mode=1.2:last_n=32
    same as --repeat-last-n 32 --repeat-penalty 1.2
  * presence_mode=.2:last_n=32
    same as --repeat-last-n 32 --presence-penalty .2
  * frequency_mode=.2:last_n=32
    same as --repeat-last-n 32 --frequency-penalty .2
  * min_length=3:tolerance=1:length_penalty=.2:last_n=-1
    match repeated sequences of at least 3 tokens within the entire context and apply a penalty of 0.2*total_length to the token that would continue the sequence. allow one non-matching token in matched sequences.

I tested that the emulation mode examples actually produced the same results as the original samplers. One difference with seqrep however is that last_n can be configured independently for each instance. The existing penalty samplers all share the last_n value so if you want a different window for the frequency penalty sampler than the one for the repeat penalty sampler then you're out of luck.

ggerganov · 2023-08-17T07:51:42Z

I'll look into this later, but most likely it will be merged after #2398

KerfuffleV2 · 2023-08-24T09:21:21Z

Going to set this as a draft for the moment since I still need to double check to ensure everything still works after rebasing for the GGUF stuff. Feedback is still very welcome.

KerfuffleV2 · 2023-08-30T11:52:33Z

Current status

Updated for the recent changes. I also added some tests. The tests cover the repetition, frequency/presence penalty emulation and the basic seqrep functionality. There are a lot of possible options that aren't covered yet.

I'm not sure I'd say it's quite ready to merge since I'd really like to get some kind of feedback. I do think it's ready to review and test and it seems to improve output (at least some types of content - I wouldn't use any presence/repetition type penalties for stuff like code generation).

The sampler is pretty large and complex but it's completely opt-in and the options don't even clog up the commandline help.

Commandline Help Example

Here's a copy of the help (from using -seqrep help) to give a more definite idea of how you can configure it:

(Click to expand)

==== Sequence Repetition Sampler Help ====

  The sequence repetition sampler takes a configuration string in the format:
  arg1:arg2:argN
  A colon separated argument can be a key value pair like xyz=1 or flag like xyz

- Available key/value arguments
  * repetition_mode=REPEAT_PENALTY
    emulates the repetition penalty sampler. warning: 1.0 disables penalties since this preset enables flag_divide_by_penalty. using 0.0 is probably not what you want
  * presence_mode=PRESENCE_PENALTY
    emulates the presence penalty sampler
  * frequency_mode=FREQUENCY_PENALTY
    Emulates the repetition penalty sampler
  * last_n
    last n tokens to consider for sequence penalizing (default: 256, 0 = disabled, -1 = ctx_size)
  * min_length
    minimum matching sequence length (default: 0, < 2 = disabled)
  * presence_penalty
    presence penalty for tokens that can continue a sequence (default: 0.000000, 0.0 = disabled)
  * length_penalty
    penalty for tokens that can continue a sequence, multiplied by length (default: 0.000000, 0.0 = disabled)
  * tolerance
    tolerance for fuzzy matching sequences (default: 0.000000, 0 = disabled)
  * mid_word_scale
    scale penalty when for mid-word tokens. 1.0 would mean apply the full penalty (default: 0.100000, 1.0 = disabled)
  * tolerance_match_credit
    credit tolerance on matched tokens (default: 0.000000, 0.0 = disabled)
  * tolerance_half_step_cost
    advanced option to adjust tolerance cost for failed matches within a half step of a match (default: 1.000000, 1.0 = normal)

- Available flags arguments (currently all default to disabled)
  * flag_immediate_wildcard
    when tolerance is consumed, by default it doesn't count as a match until a real match is found
  * flag_tolerance_no_consecutive
    do not allow using tolerance consecutively
  * flag_tolerance_no_first
    do not allow using tolerance before the first match
  * flag_tolerance_cap_initial
    only meaningful with match credit, prevents match credit adjusting tolerance higher than the initial value
  * flag_penalize_length_max_seen
    when applying length_penalty, use the maximum seen sequence length rather than the total length of seen sequences
  * flag_divide_by_penalty
    divide the logit when applying a penalty rather than subtracting it. warning: when this flag is enabled, 1.0 disables penalties not 0.0. 0.0 is probably not what you want

- Examples:
  * repetition_mode=1.2:last_n=32
    same as --repeat-last-n 32 --repeat-penalty 1.2
  * presence_mode=.2:last_n=32
    same as --repeat-last-n 32 --presence-penalty .2
  * frequency_mode=.2:last_n=32
    same as --repeat-last-n 32 --frequency-penalty .2
  * min_length=3:tolerance=1:length_penalty=.2:last_n=-1
    match repeated sequences of at least 3 tokens within the entire context and apply a penalty of 0.2*total_length to the token that would continue the sequence. allow one non-matching token in matched sequences.

The Pitch

Emulating Existing Penalty Samplers

It's possible to specify the -seqrep option multiple times and add as many seqrep samplers as you want. Since it can emulate repetition, presence/repetition penalty samplers it's also possible to use different last_n for those samplers which currently isn't possible. It's also possible to add more than one instance of those samplers with different windows. I.E. strongly discourage some types of repetition in a small window and then have a larger window with a weaker penalty to guide the output. Using it like an enhanced version of the existing samplers is probably the simplest way to demonstrate a benefit.

Seqrep Mode

The full seqrep mode can be used to discourage repeating sequences that already exist in the prompt/generated tokens.

Let's say we have these tokens and the minimum match length set to 3:

1 2 3 4 1 2 3

If the token 4 was generated next, this would repeat an existing sequence. To prevent repeating the sequence, we'd want to penalize token 4 and (hopefully) encourage more varied output.

The sampler also has a tolerance setting which can be set to a floating point number. Tolerance is basically credit against non-matching tokens. So if tolerance=1.0:

1 2 2 4 1 2 3

The above could also count as a match.

Word Boundary Awareness

The sampler also accepts a floating point mid_word_scale setting. When a penalty is applied, it'll be scaled by this value. 1.0 means apply the full penalty, 0.0 would disable the penalty entirely for tokens that are considered "mid-word". What does "mid-word" mean? Here's an example. Suppose we're having a chat with the LLM about coyotes and the word "coyote" tokenizes to coy(1), ote(2). It's very likely token 2 will appear in the conversion quite often. When penalizing tokens, if you let the LLM generate token 1 ( coy) but then heavily penalize ote you can easily get nonsensical results.

The sampler has logic to try to identify whether the token to potentially be penalized is at a word boundary. coy(1) is at the end of last tokens and we're deciding whether otecounts as mid-word, the answer would be yes: coyhas a boundary at the start, but not the end.ote` doesn't have a boundary at the start.

On the other hand if last tokens ended with coy(1), :(3) then ote would no longer count as mid-word because the token before it has a boundary and ote can be penalized without interrupting a word. There's a bit of special logic to deal with apostrophe which doesn't actually count as a boundary. Apostrophe is also a good example of where word boundary awareness can be helpful. 's is very common in English text, and if you're applying a repetition/presence penalty it's very likely 's will end up being targeted. This isn't ideal because at the point you'd penalize 's the model has already generated the word it wants to pluralize. You say "NO!" and then what? Weird stuff happens.

Aside from the seqrep stuff, I think this is an interesting idea to explore for other types of samplers like temperature for example. Applying more randomness to the start of words rather than syllables in the middle or something like a 's is likely to generate better output.

ggerganov

Let's try to gather some feedback and see how useful the change is.

An alternative option that I would not have a problem to merge is to implement this as a basic example instead of part of llama.cpp. For example, it can be part of common - directly in common/common.cpp or a new common/experiments.h/.cpp or common/staging.h/.cpp and enable it in main. If the approach proves useful and finds adoption, we can merge it in llama.cpp and maintain it long-term

nahuel89p · 2023-09-01T21:41:00Z

Excuse me if it's inappropriate to chime in. I'm more of an end user, and noticed that the LLM (in my case nous-hermes-llama2-13b.ggmlv3.q4_K_M) is oddly inclined to literally repeat word by word a previously llm-generated summary text included as 'context info' in the prompt, while completely ignoring the main text it was instructed to summarize as per the prompt instruction.

My intuition is that the LLM is overly sensitive to its own style of text in the sense that it triggers a very high signal mathematically speaking. Even more so because the original text I'm trying to summarize is a audio-to-text transcription from a spoken interview, which is obviously less structured and more chaotic.

This PR seems to tackle this issue, right? Particularly this:

Seqrep Mode
The full seqrep mode can be used to discourage repeating sequences that already exist in the prompt

Since I'm interested in this kind of summarization chain (where a preceding llm-generated summary text is included in the prompt along with the new text to be summarized), I'm interested in running tests with this proposed solution.

While I'm attempting a go at it, what's a good starting point regarding combination of parameters for this use case?

KerfuffleV2 · 2023-09-01T22:08:03Z

Excuse me if it's inappropriate to chime in

100% the opposite, this kind of comment is exactly the kind of thing I'm looking for!

oddly inclined to literally repeat word by word a previously llm-generated summary text included as 'context info' in the prompt [...] This PR seems to tackle this issue, right?

Yes, pretty much that's exactly the idea. I haven't tried anything with summaries, but I sometimes like to mess around with having the LLM generate creative stuff. One thing that really annoyed me was how it would just repeat sections from the prompt or what it had previously generated pretty much verbatim.

While I'm attempting a go at it, what's a good starting point regarding combination of parameters for this use case?

That's a hard one to answer because even for my own use case I'm not really sure what combination is best. I haven't tried summarizing. I can give you some places to start, but I'm not sure how they'll work for your use case. One thing you could also try (potentially using the examples as a base) is to change the values to an extreme and see how it effects the output. That will help get a feel for how the settings affect output.

last_n=-1:min_length=3:tolerance=1:length_penalty=.2:mid_word_scale=.1

Pretty simple one to get started. This only will lightly penalize tokens that are considered mid word. If you want to see how the mid word scale stuff is affecting output, you can set mid_word_scale=5 or something absurd like that.

min_length=3:tolerance=1:tolerance_match_credit=.25:tolerance_half_step_cost=.25:flag_divide_by_penalty:presence_penalty=1.2:length_penalty=1.1:flag_tolerance_no_consecutive:last_n=-1:mid_word_scale=0

This is one that seems to work pretty well for creative stuff, possibly also with flag_tolerance_no_first. flag_tolerance_no_first and flag_tolerance_no_consecutive seem to let you set tolerance more aggressively without just matching everything due to those constraints. I like flag_divide_by_penalty because it applies a scaled rather than absolute penalty - since we don't know exactly what the value of the logit we're penalizing is this seems like it creates more predictable results. I may make it the default.

Note that when flag_divide_by_penalty is enabled, 1.0 is the neutral value and you increase it to apply a penalty. 1.0 - no penalty, 1.1 - mild penalty, etc. By the way, with both penalty modes you can also use this to encourage repeating sequences... Though I don't know why anyone would want to. :)

Since it's a little unintuitive, after min_length has already matched, stuff like tolerance, tolerance_match_credit, etc to extend how long the match is will only affect length_penalty. The main thing is whether min_length is satisfied or not.

I think this approach can help encourage more varied output and avoid repeating sequences but there are definitely limitations. One thing is that once a sequence is identified, you (currently) can only penalize the tail of the sequence. You can't say "Stop repeating stuff, start over and write it in a more creative way". You only get to say "no" to the last token. I wrote a little more about that over here: #2888 (comment)

Anyway, thanks for testing and don't worry about annoying me with questions or anything like that. (Although there's no guarantee I'll be able to give you a good/useful answer.)

nonnull-ca · 2023-09-04T23:55:57Z

Weird thought - and feel free to let me know if this is too tangential.

What about a backtracking approach? That is:

Once you have detected a repeated sequence, with some probability stop, back up to the start of the sequence, and retry from there.

The advantage of this is that it results in an essentially unbiased distribution even for longer sequences - if I'm penalizing xxxxxx where my model is just x 90% of the time and y 10% of the time, the probability of xxxxxy, xyxxxx, ..., and yxxxxx will be identical (assuming other samplers don't bias this, at least.)

'Penalize tokens that aren't the start of the word' is a good heuristic, but still fails occasionally.

An obvious disadvantage of this is that it can result in significant backtracking and tail latency, and either requires speculative results or delaying token outputs until you know they aren't the prefix of any repeat.

KerfuffleV2 · 2023-09-05T00:46:34Z

Once you have detected a repeated sequence, with some probability stop, #2946, and retry from there.

I'm not sure if you noticed, but I'm actually the one who started that topic! The reason was actually exactly this thought: how to enable a backtracking approach where it cuts off the start of the sequence rather than only being able to say no to the token that would continue it.

If I say your idea is genius, that's basically the same as saying mine was too. Right? So let's go with that. :)

ycros · 2023-09-05T02:22:26Z

Hi, I've been keeping an eye on this PR as I've been having similar thoughts about longer sequence similarity penalties and backtracking. I was contemplating a proof-of-concept backtracking implementation in koboldcpp (which is what I mainly use and is downstream of llama.cpp). In my own experiments doing a lot of guided narrative writing tasks with instruction models, I've noticed that the repetition seems to occur on a paragraph level (ie. separated by \n\n) with the models I'm mostly using, and I find that the paragraphs before and after within the same generation still generally end up completely different.

My thinking is that if you could define a boundary to operate on and backtrack to (in my case, \n\n), this could maybe work very well.

KerfuffleV2 · 2023-09-05T14:02:06Z

@ycros I'll keep your approach in mind, but I'm going to start out using the existing sequence detection code and just changing what happens when there's a sequence. I think I can use that and then just seek back to the longest observed sequence and ban the token it started with.

I haven't had a chance to mess with with that yet and I also need to make Mr. GG's advice and reorganize the structure so this isn't in llama.cpp.

KerfuffleV2 · 2023-09-10T11:01:53Z

How many times will I toggle this pull between draft and ready? Tune in to the next exciting episode to find out!

There's still a decent amount of cleanup work to do after restructuring the code. It also doesn't support cmake yet. When using make setting LLAMA_DISABLE_SEQREP_SAMPLER=1 will build without the seqrep stuff.

The main feature I want to add is a way to rewind n_past to cut off the beginning of repeated sequences rather than just penalizing the continuation. I think it'll be pretty interesting to see how effective that approach is, but it's going to be a bit tricky since other samplers don't/can't work that way.

KerfuffleV2 · 2023-09-11T03:03:43Z

Very rough, but rewind mode pretty much works now and seems to do a decent job of encouraging diverse input. Still a lot of cleanup to do.

--temp 1.2 
--typical 0.25 
--top-k 70 
--top-p 2.0 
--tfs 0.95 
--repeat-last-n 0 
--ignore-eos 
-seqrep min_length=5:tolerance_match_credit=.25:tolerance_half_step_cost=.6:last_n=-1:presence_penalty=.1:flag_rewind_mode:flag_rewind_skip_ws_punct

Model tested is airoboros-l2-70b-2.1.Q4_K_M.gguf

Note: Only supports make, not cmake. I pulled in the simple-inference example: if you want to test rewinding you need to use that (normal seqrep should work in main but rewind seqreps will just be ignored).

Example output. (Warning: Large image)

* add safetensors to convert.py help message * Check for single-file safetensors model * Update convert.py "model" option help message * revert convert.py help message change

* Add support for stablelm-3b-4e1t * Supports GPU offloading of (n-1) layers

Co-authored-by: Jared Van Bortel <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

Co-authored-by: Bernhard Gstrein <[email protected]>

…#4040) * gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode. * Respect add_bos_token GGUF metadata value * gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time

KerfuffleV2 · 2023-11-17T02:43:03Z

Oh no, this is why I was always scared of merging instead of rebasing!

cebtenzzre · 2023-11-17T12:18:15Z

Oh no, this is why I was always scared of merging instead of rebasing!

If you merge and then choose to rebase, you need to rebase onto a commit that includes the commits that were already merged, or use git rebase -i --rebase-merges and remove them. Otherwise, it will rebase the merged commits too.

It's easier if you stick to one or the other for the most part, and not alternate...

KerfuffleV2 · 2023-11-18T07:54:41Z

git rebase -i --rebase-merges

Thanks, I think this is what I needed. Actually, it probably wasn't even an issue with merging. I went to rebase to try to get rid of the merges (I had several in a row that I hadn't pushed) and I aborted out of the git rebase -i without saving anything and I think git still decided to do some random stuff.

Automatically clear completed sequences out of the KV cache

cebtenzzre · 2023-11-19T15:37:17Z

I aborted out of the git rebase -i without saving anything

If you quit without saving it will do the rebase anyway. It will only not do the rebase if your editor returns a nonzero status, e.g. :cq with vim. Otherwise, you need to git rebase --abort if there are conflicts, or git reset --keep this-branch@{1} (assuming your branch is called 'this-branch') if it completes successfully (git reflog --date=iso this-branch is a handy visualization).

IkariDevGIT · 2023-11-30T13:22:13Z

why was this closed and what will, happen now?

KerfuffleV2 · 2023-12-02T07:14:24Z

why was this closed and what will, happen now?

Were you actually using it? As far as I knew, only a few people ever even tried it and it was also a long, long way from being mergeworthy. Probably would just need to be rewritten entirely, not sure I have the time and energy to invest into doing that at the moment. More feedback was also necessary to understand what features people were using, what didn't work well, etc.

I may maintain the branch in my fork mostly for my own use.

IkariDevGIT · 2023-12-02T08:52:53Z

@KerfuffleV2 If you can, please reopen it. I havent used it but the idea behind it is worth it to try, ill try it if its reopend again!

crasm · 2023-12-02T10:17:50Z

@KerfuffleV2 There may be a chicken-and-egg problem here. This looks really neat, and I've been following it since you originally posted it. However, without a more convenient UI, I've been putting off testing and hoping it gets simplified. There are too many knobs to tweak.

I wonder how hard it would be to add this to example/server? With all the parameters, likely difficult.

That said, the concept is solid. I'm optimistic this will still matter in the future, even if you stop maintaining it.

@IkariDevGIT you can try it even though it's closed. On their fork

IkariDevGIT · 2023-12-02T11:48:40Z

@crasm I know, but i want him to continue working on it here, thats why i said "ill try it if its reopend again!". Dont want this idea to die again.

cebtenzzre · 2023-12-02T12:43:59Z

I'm reopening for visibility - I'd like to see more feedback on this, even if it remains a demo for now.

Drael64 · 2023-12-30T11:06:35Z

Language models need this. Too much GPT fluff in the phrasing.

IkariDevGIT · 2023-12-30T16:13:43Z

Agreed, please don't let this PR die.

KerfuffleV2 force-pushed the feat-seqrep-sampler-simple branch from ab27956 to cb02274 Compare August 14, 2023 11:03

KerfuffleV2 changed the title ~~Initial implementation of a sequence repetition penalty sampler~~ Implementation of a sequence repetition penalty sampler Aug 16, 2023

KerfuffleV2 force-pushed the feat-seqrep-sampler-simple branch from 17eaa96 to 3ca1717 Compare August 16, 2023 23:29

ggerganov added enhancement New feature or request generation quality Quality of model output labels Aug 17, 2023

KerfuffleV2 force-pushed the feat-seqrep-sampler-simple branch from 84ee695 to 0875559 Compare August 24, 2023 09:10

KerfuffleV2 marked this pull request as draft August 24, 2023 09:21

KerfuffleV2 force-pushed the feat-seqrep-sampler-simple branch 2 times, most recently from a8fd7d6 to 33d6ce0 Compare August 30, 2023 11:09

KerfuffleV2 marked this pull request as ready for review August 30, 2023 11:20

ggerganov added the need feedback Testing and feedback with results are needed label Sep 1, 2023

ggerganov reviewed Sep 1, 2023

View reviewed changes

Jipok mentioned this pull request Sep 2, 2023

[Bug] Suggested Fixes for mathematical inaccuracy in llama_sample_repetition_penalty function #2970

Closed

KerfuffleV2 mentioned this pull request Sep 4, 2023

Converting GGML->GGUF: ValueError: Only GGJTv3 supported #2990

Closed

KerfuffleV2 force-pushed the feat-seqrep-sampler-simple branch from 33d6ce0 to 64caa4f Compare September 4, 2023 14:09

KerfuffleV2 force-pushed the feat-seqrep-sampler-simple branch from 64caa4f to 0ae4c07 Compare September 10, 2023 10:57

KerfuffleV2 marked this pull request as draft September 10, 2023 10:57

kalomaze mentioned this pull request Sep 12, 2023

Is anyone working on 'conditional' logit biasing? (How to potentially improve repetition penalty?) #3149

Closed

KerfuffleV2 and others added 9 commits November 9, 2023 00:10

Minor cleanups.

557d867

Let's try merging master instead of rebasing for a little change of pace

930e132

convert.py: also look for plain model.safetensors (ggerganov#4043)

3c76bd6

* add safetensors to convert.py help message * Check for single-file safetensors model * Update convert.py "model" option help message * revert convert.py help message change

stablelm : StableLM support (ggerganov#3586)

2751031

* Add support for stablelm-3b-4e1t * Supports GPU offloading of (n-1) layers

Fix MacOS Sonoma model quantization (ggerganov#4052)

affa88b

Co-authored-by: Jared Van Bortel <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

ggml-cuda : increase max graph size (ggerganov#4084)

208bdcd

llama : restore prefix space in llama tokenizer (ggerganov#4081)

4fc5f7d

gguf : fix potential infinite loops while parsing (ggerganov#4100)

b94b982

Co-authored-by: Bernhard Gstrein <[email protected]>

Respect tokenizer.ggml.add_bos_token value when tokenizing (ggerganov…

c301973

…#4040) * gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode. * Respect add_bos_token GGUF metadata value * gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time

Merge branch 'master' into feat-seqrep-sampler-simple

16868e2

KerfuffleV2 added 3 commits November 18, 2023 04:38

Merge branch 'master' into feat-seqrep-sampler-simple

f109568

Merge branch 'master' into feat-seqrep-sampler-simple

89262de

Fix(ish?) prompt tokenizing

046a469

Automatically clear completed sequences out of the KV cache

Merge branch 'master' into feat-seqrep-sampler-simple

dc1e34a

KerfuffleV2 closed this Nov 29, 2023

cebtenzzre reopened this Dec 2, 2023

alexconstant9108 mentioned this pull request Dec 2, 2023

In Interactive mode, add a command to clear current context #4298

Closed

4 tasks

cebtenzzre mentioned this pull request Jan 17, 2024

server: allow to specify tokens as strings in logit_bias #5003

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of a sequence repetition penalty sampler #2593

Implementation of a sequence repetition penalty sampler #2593

KerfuffleV2 commented Aug 12, 2023 •

edited

Loading

KerfuffleV2 commented Aug 16, 2023 •

edited

Loading

ggerganov commented Aug 17, 2023

KerfuffleV2 commented Aug 24, 2023

KerfuffleV2 commented Aug 30, 2023 •

edited

Loading

ggerganov left a comment

nahuel89p commented Sep 1, 2023

KerfuffleV2 commented Sep 1, 2023

nonnull-ca commented Sep 4, 2023

KerfuffleV2 commented Sep 5, 2023

ycros commented Sep 5, 2023

KerfuffleV2 commented Sep 5, 2023

KerfuffleV2 commented Sep 10, 2023

KerfuffleV2 commented Sep 11, 2023

KerfuffleV2 commented Nov 17, 2023

cebtenzzre commented Nov 17, 2023

KerfuffleV2 commented Nov 18, 2023

cebtenzzre commented Nov 19, 2023

IkariDevGIT commented Nov 30, 2023

KerfuffleV2 commented Dec 2, 2023

IkariDevGIT commented Dec 2, 2023

crasm commented Dec 2, 2023

IkariDevGIT commented Dec 2, 2023

cebtenzzre commented Dec 2, 2023

Drael64 commented Dec 30, 2023

IkariDevGIT commented Dec 30, 2023

Implementation of a sequence repetition penalty sampler #2593

Are you sure you want to change the base?

Implementation of a sequence repetition penalty sampler #2593

Conversation

KerfuffleV2 commented Aug 12, 2023 • edited Loading

Note

KerfuffleV2 commented Aug 16, 2023 • edited Loading

ggerganov commented Aug 17, 2023

KerfuffleV2 commented Aug 24, 2023

KerfuffleV2 commented Aug 30, 2023 • edited Loading

Current status

Commandline Help Example

The Pitch

Emulating Existing Penalty Samplers

Seqrep Mode

Word Boundary Awareness

ggerganov left a comment

Choose a reason for hiding this comment

nahuel89p commented Sep 1, 2023

KerfuffleV2 commented Sep 1, 2023

nonnull-ca commented Sep 4, 2023

KerfuffleV2 commented Sep 5, 2023

ycros commented Sep 5, 2023

KerfuffleV2 commented Sep 5, 2023

KerfuffleV2 commented Sep 10, 2023

KerfuffleV2 commented Sep 11, 2023

KerfuffleV2 commented Nov 17, 2023

cebtenzzre commented Nov 17, 2023

KerfuffleV2 commented Nov 18, 2023

cebtenzzre commented Nov 19, 2023

IkariDevGIT commented Nov 30, 2023

KerfuffleV2 commented Dec 2, 2023

IkariDevGIT commented Dec 2, 2023

crasm commented Dec 2, 2023

IkariDevGIT commented Dec 2, 2023

cebtenzzre commented Dec 2, 2023

Drael64 commented Dec 30, 2023

IkariDevGIT commented Dec 30, 2023

KerfuffleV2 commented Aug 12, 2023 •

edited

Loading

KerfuffleV2 commented Aug 16, 2023 •

edited

Loading

KerfuffleV2 commented Aug 30, 2023 •

edited

Loading