Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama_cpp: add speculative decoding #6669

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

kanttouchthis
Copy link
Contributor

This PR implements basic support for speculative decoding for the llama_cpp loader. It supports loading a gguf model as a draft model or using the built-in LlamaPromptLookupDecoding class. Tested with Qwen2.5-Coder-32B-Instruct-Q4_K_L.gguf and qwen2.5-coder-1.5b-instruct-q4_k_m.gguf as the draft model, showing a significant speed-up:
no speculative decoding: 22t/s
prompt lookup decoding: 24t/s
draft model decoding: 30t/s
The PR still needs some testing. Not sure if setting all the parameters for the draft model in the UI would be desirable or if copying parameters from the main model would be more appropriate.

Checklist:

@kanttouchthis
Copy link
Contributor Author

after more testing, i am occasionally getting this error:

Traceback (most recent call last):
  File "C:\text-generation-webui\modules\callbacks.py", line 59, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\text-generation-webui\modules\llamacpp_model.py", line 262, in generate
    for completion_chunk in completion_chunks:
  File "C:\text-generation-webui\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py", line 1317, in _create_completion
    for token in self.generate(
  File "C:\text-generation-webui\modules\llama_cpp_python_hijack.py", line 117, in my_generate
    for output in self.original_generate(*args, **kwargs):
  File "C:\text-generation-webui\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py", line 909, in generate
    self.eval(tokens)
  File "C:\text-generation-webui\modules\llama_cpp_python_hijack.py", line 89, in eval_with_progress
    self.scores[n_past : n_past + n_tokens, :].reshape(-1)[::] = logits
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^
ValueError: could not broadcast input array from shape (912384,) into shape (0,)

setting self.context_params.logits_all=False fixes this, so it's a problem specifically with saving all logits in llama_cpp_python_hijack.py:83:

if self.context_params.logits_all:
    rows = n_tokens
    cols = self._n_vocab
    logits = np.ctypeslib.as_array(
        self._ctx.get_logits(), shape=(rows * cols,)
    )
    self.scores[n_past : n_past + n_tokens, :].reshape(-1)[::] = logits
    self.last_updated_index = n_past + n_tokens - 1

It happens when using prompt lookup decoding or a draft model and seems to depend on the input tokens as it only happens sometimes. Not sure how to fix this right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant