llama_cpp: add speculative decoding #6669

kanttouchthis · 2025-01-16T02:26:36Z

This PR implements basic support for speculative decoding for the llama_cpp loader. It supports loading a gguf model as a draft model or using the built-in LlamaPromptLookupDecoding class. Tested with Qwen2.5-Coder-32B-Instruct-Q4_K_L.gguf and qwen2.5-coder-1.5b-instruct-q4_k_m.gguf as the draft model, showing a significant speed-up:
no speculative decoding: 22t/s
prompt lookup decoding: 24t/s
draft model decoding: 30t/s
The PR still needs some testing. Not sure if setting all the parameters for the draft model in the UI would be desirable or if copying parameters from the main model would be more appropriate.

Checklist:

I have read the Contributing guidelines.

kanttouchthis · 2025-01-17T01:22:09Z

after more testing, i am occasionally getting this error:

Traceback (most recent call last):
  File "C:\text-generation-webui\modules\callbacks.py", line 59, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\text-generation-webui\modules\llamacpp_model.py", line 262, in generate
    for completion_chunk in completion_chunks:
  File "C:\text-generation-webui\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py", line 1317, in _create_completion
    for token in self.generate(
  File "C:\text-generation-webui\modules\llama_cpp_python_hijack.py", line 117, in my_generate
    for output in self.original_generate(*args, **kwargs):
  File "C:\text-generation-webui\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py", line 909, in generate
    self.eval(tokens)
  File "C:\text-generation-webui\modules\llama_cpp_python_hijack.py", line 89, in eval_with_progress
    self.scores[n_past : n_past + n_tokens, :].reshape(-1)[::] = logits
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^
ValueError: could not broadcast input array from shape (912384,) into shape (0,)

setting self.context_params.logits_all=False fixes this, so it's a problem specifically with saving all logits in llama_cpp_python_hijack.py:83:

if self.context_params.logits_all:
    rows = n_tokens
    cols = self._n_vocab
    logits = np.ctypeslib.as_array(
        self._ctx.get_logits(), shape=(rows * cols,)
    )
    self.scores[n_past : n_past + n_tokens, :].reshape(-1)[::] = logits
    self.last_updated_index = n_past + n_tokens - 1

It happens when using prompt lookup decoding or a draft model and seems to depend on the input tokens as it only happens sometimes. Not sure how to fix this right now.

llama_cpp: add speculative decoding

d1ce766

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama_cpp: add speculative decoding #6669

llama_cpp: add speculative decoding #6669

kanttouchthis commented Jan 16, 2025

kanttouchthis commented Jan 17, 2025

llama_cpp: add speculative decoding #6669

Are you sure you want to change the base?

llama_cpp: add speculative decoding #6669

Conversation

kanttouchthis commented Jan 16, 2025

Checklist:

kanttouchthis commented Jan 17, 2025