Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: how to use EAGLE on vLLM? #11126

Open
1 task done
xiongqisong opened this issue Dec 12, 2024 · 27 comments
Open
1 task done

[Usage]: how to use EAGLE on vLLM? #11126

xiongqisong opened this issue Dec 12, 2024 · 27 comments
Labels
usage How to use vllm

Comments

@xiongqisong
Copy link

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

I want to test EAGLE on vllm, but i try so many methods to run EAGLE, fail so many times.
The target model is Llama2-chat-hf, and the draft model is EAGLE-Llama2-chat in original EAGLE's author's github.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@xiongqisong xiongqisong added the usage How to use vllm label Dec 12, 2024
@llsj14
Copy link
Contributor

llsj14 commented Dec 16, 2024

I encountered the same problem.
I tried the following commands, but they resulted in errors related to loading EAGLE models.
After modifying the pt_weights_iterator function, I managed to load the models. However, the accept rate is very low (~0.068).

@sroy745 @LiuXiaoxuanPKU
Could you provide some guidance on running EAGLE models on vLLM? Should I convert the original EAGLE model? Any tips would be greatly appreciated!

Command

python -m vllm.entrypoints.openai.api_server \
     --model meta-llama/Llama-2-7b-chat-hf \
     --enable-prefix-caching \
     --port 8088 \
     --disable-custom-all-reduce \
     --swap-space 0 \
     --gpu-memory-utilization 0.9 \
     --speculative-model yuhuili/EAGLE-llama2-chat-7B \
     --num-speculative-tokens 2
python -m vllm.entrypoints.openai.api_server \
     --model meta-llama/Llama-3.1-70B-Instruct \
     --enable-prefix-caching \
     --port 8088 \
     --disable-custom-all-reduce \
     --swap-space 0 \
     --gpu-memory-utilization 0.9 \
     --speculative-modelyuhuili/EAGLE-LLaMA3-Instruct-70B \
     --num-speculative-tokens 2 \
     -tp 4

Error message

    self.proposer_worker.load_model()
  File "/vllm/vllm/worker/worker.py", line 155, in load_model
    self.model_runner.load_model()
  File "/vllm/vllm/worker/model_runner.py", line 1093, in load_model
    self.model = get_model(vllm_config=self.vllm_config)
  File "/vllm/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
    return loader.load_model(vllm_config=vllm_config)
  File "/vllm/vllm/model_executor/model_loader/loader.py", line 370, in load_model
    loaded_weights = model.load_weights(
  File "/vllm/vllm/model_executor/models/llama.py", line 594, in load_weights
    return loader.load_weights(
  File "/vllm/vllm/model_executor/models/utils.py", line 242, in load_weights
    autoloaded_weights = set(self._load_module("", self.module, weights))
  File "/vllm/vllm/model_executor/models/utils.py", line 229, in _load_module
    raise ValueError(msg)
ValueError: There is no module or parameter named 'embed_tokens' in LlamaForCausalLM

What I tried

--- a/vllm/model_executor/model_loader/weight_utils.py
+++ b/vllm/model_executor/model_loader/weight_utils.py
@@ -414,6 +414,7 @@ def pt_weights_iterator(
     hf_weights_files: List[str]
 ) -> Generator[Tuple[str, torch.Tensor], None, None]:
     """Iterate over the weights in the model bin/pt files."""
+    #print("hf_weights_files: ", hf_weights_files)
     enable_tqdm = not torch.distributed.is_initialized(
     ) or torch.distributed.get_rank() == 0
     for bin_file in tqdm(
@@ -423,6 +424,21 @@ def pt_weights_iterator(
             bar_format=_BAR_FORMAT,
     ):
         state = torch.load(bin_file, map_location="cpu")
+
+        def transform_key(key: str) -> str:
+            """Transforms keys based on predefined rules."""
+            if key.startswith("embed_tokens"):
+                return "model." + key  # Add 'model.' prefix for 'embed_tokens'
+            if key.startswith("fc"):
+                return None  # Llama does not support fc layers
+            return key
+
+        state = {
+            new_key: weight_value
+            for weight_name, weight_value in state.items()
+            if (new_key := transform_key(weight_name)) is not None
+        }
+
+        str_values = [weight_name for weight_name, _ in state]
+        print("Extracted str values:", str_values)
         yield from state.items()
         del state
         torch.cuda.empty_cache()

@xiongqisong
Copy link
Author

I use vscode to debug EAGLE on vLLM, here is the debug config of mine:

{
    "name": "debug eagle latency",
    "type": "debugpy",
    "request": "launch",
    "program": "benchmarks/benchmark_latency.py",
    "console": "integratedTerminal",
    "args": [
        "--model",
        "/data/xqs/model/Llama-2-7b-chat-hf",
        "--speculative-model",
        "/data1/xqs/EAGLE-llama",
        "--num_speculative_tokens",
        "4",
        "--use-v2-block-manager",
        "--batch-size",
        "1",
        "--input-len",
        "1024",
        "--output-len",
        "128",
        "--max-model-len",
        "2048"
    ],
    "justMyCode": false
}

the config.json file of draft model (/data1/xqs/EAGLE-llama) is:

{
	"model_type": "eagle",
	"model": {
		"architectures": ["LlamaForCausalLM"],
		"bos_token_id": 1,
		"eos_token_id": 2,
		"hidden_act": "silu",
		"hidden_size": 4096,
		"initializer_range": 0.02,
		"intermediate_size": 11008,
		"max_position_embeddings": 4096,
		"model_type": "eagle",
		"num_attention_heads": 32,
		"num_hidden_layers": 1,
		"num_key_value_heads": 32,
		"pretraining_tp": 1,
		"rms_norm_eps": 1e-06,
		"rope_scaling": null,
		"tie_word_embeddings": false,
		"torch_dtype": "float16",
		"transformers_version": "4.32.0.dev0",
		"use_cache": true,
		"vocab_size": 32000
	}
}

the files contained in draft model folder is:

图片

the convert script to change EAGLE draft model weights to vLLM format weights is:

import json

import torch
from safetensors.torch import load_file, save_file

ckpt = torch.load("/data/model/EAGLE-llama2-chat-7B/pytorch_model.bin")
ref_ckpt = load_file("/data/model/Llama-2-7b-chat-hf/model-00002-of-00002.safetensors")

ckpt['lm_head.weight'] = ref_ckpt['lm_head.weight']

save_file(ckpt, "/data1/xqs/EAGLE-llama/model.safetensors")

with open("/data/model/EAGLE-llama2-chat-7B/config.json") as rf:
    cfg = json.load(rf)

cfg = {"model_type": "eagle", "model": cfg}

with open("/data1/xqs/EAGLE-llama/config.json", "w") as wf:
    json.dump(cfg, wf)

# delete EAGLE-LLaMA3-Instruct-8B/pytorch_model.bin

the conversion script is referenced from https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d,the author support EAGLE on vLLM, but he didn't show a full usage case of EAGLE on vLLM, that's confused~

the error i faced is:

[rank0]:   File "/data1/xqs/vllm/vllm/executor/gpu_executor.py", line 39, in _init_executor
[rank0]:     self.driver_worker.init_device()
[rank0]:   File "/data1/xqs/vllm/vllm/spec_decode/spec_decode_worker.py", line 307, in init_device
[rank0]:     self.proposer_worker.load_model()
[rank0]:   File "/data1/xqs/vllm/vllm/worker/worker.py", line 152, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/data1/xqs/vllm/vllm/worker/model_runner.py", line 1066, in load_model
[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
[rank0]:   File "/data1/xqs/vllm/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
[rank0]:     return loader.load_model(vllm_config=vllm_config)
[rank0]:   File "/data1/xqs/vllm/vllm/model_executor/model_loader/loader.py", line 305, in load_model
[rank0]:     model.load_weights(self._get_all_weights(model_config, model))
[rank0]:   File "/data1/xqs/vllm/vllm/model_executor/models/llama.py", line 581, in load_weights
[rank0]:     loader.load_weights(
[rank0]:   File "/data1/xqs/vllm/vllm/model_executor/models/utils.py", line 230, in load_weights
[rank0]:     autoloaded_weights = list(self._load_module("", self.module, weights))
[rank0]:   File "/data1/xqs/vllm/vllm/model_executor/models/utils.py", line 219, in _load_module
[rank0]:     raise ValueError(msg)
[rank0]: ValueError: There is no module or parameter named 'embed_tokens' in LlamaForCausalLM

It seems like the vLLM doesn't run load_weights method in EAGLE class which code is in model_executor/models/eagle.py? It's so weird~

@llsj14
Copy link
Contributor

llsj14 commented Dec 16, 2024

In my case, after converting the eagle checkpoints using the script, I encountered following two errors.
I only succedded in running Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination.
vllm last commit: 69ba344

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

  • TP: 1
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
ERROR 12-16 05:03:28 engine.py:366]     loaded_weights = model.load_weights(
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/models/eagle.py", line 161, in load_weights
ERROR 12-16 05:03:28 engine.py:366]     raise ValueError("Found bias in the loaded weights "
ERROR 12-16 05:03:28 engine.py:366] ValueError: Found bias in the loaded weights but the model config doesn't have bias

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

  • TP: 4
    self.worker = worker_class(*args, **kwargs)
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 92, in create_spec_worker
    spec_decode_worker = SpecDecodeWorker.create_worker(
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 182, in create_worker
    raise NotImplementedError(
NotImplementedError: EAGLE does not support TP > 1 yet

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

  • TP: 1
  • Number of speculative tokens: 2
  • No errors (Run successful)
  • However, the acceptance rate is very low.
  • I benchmarked using LLMPerf by sending random inputs from sonnet.txt.
INFO 12-16 05:51:52 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.086, System efficiency: 0.354, Number of speculative tokens: 2, Number of accepted tokens: 43521, Number of draft tokens: 508066, Number of emitted tokens: 269680.

@llsj14
Copy link
Contributor

llsj14 commented Dec 16, 2024

@abhigoyal1997
I am grateful for your Eagle checkpoint converter, which allowed me to successfully run the Llama 3 combinations as described above. Do you have any updates on your converters, and do you think the acceptance rate of the Llama 3 combinations is reasonable?

@xiongqisong
Copy link
Author

@abhigoyal1997 I am grateful for your Eagle checkpoint converter, which allowed me to successfully run the Llama 3 combinations as described above. Do you have any updates on your converters, and do you think the acceptance rate of the Llama 3 combinations is reasonable?

I think you can put the target model and draft model run in the EAGLE original github code, if the accelerate is normal, like the accelerate in article(2~4x), i think the problem shouldn't be EAGLE but vLLM.
I try so many model and draft model on EAGLE original github code, work fine, it's so hard to use EAGLE on vLLM.

@xiongqisong
Copy link
Author

xiongqisong commented Dec 16, 2024

I think the code in vllm/transformers_utils/configs/eagle.py line 14 is wrong:
图片
because hf transformers doesn't support model 'eagle', so we can't use AutoConfig to config EAGLE draft model.
My transformers version is 4.46.3, and i can use vLLM to run target model(or call it original model)——the Llama-2-7b-chat-hf.

@bettybaii
Copy link

In my case, after converting the eagle checkpoints using the script, I encountered following two errors. I only succedded in running Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination. vllm last commit: 69ba344

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

  • TP: 1
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
ERROR 12-16 05:03:28 engine.py:366]     loaded_weights = model.load_weights(
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/models/eagle.py", line 161, in load_weights
ERROR 12-16 05:03:28 engine.py:366]     raise ValueError("Found bias in the loaded weights "
ERROR 12-16 05:03:28 engine.py:366] ValueError: Found bias in the loaded weights but the model config doesn't have bias

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

  • TP: 4
    self.worker = worker_class(*args, **kwargs)
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 92, in create_spec_worker
    spec_decode_worker = SpecDecodeWorker.create_worker(
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 182, in create_worker
    raise NotImplementedError(
NotImplementedError: EAGLE does not support TP > 1 yet

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

  • TP: 1
  • Number of speculative tokens: 2
  • No errors (Run successful)
  • However, the acceptance rate is very low.
  • I benchmarked using LLMPerf by sending random inputs from sonnet.txt.
INFO 12-16 05:51:52 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.086, System efficiency: 0.354, Number of speculative tokens: 2, Number of accepted tokens: 43521, Number of draft tokens: 508066, Number of emitted tokens: 269680.

After converting the Eagle checkpoints using the script, I encountered the same error as you did. I only succeeded in running the Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination.

However, when using Llama-2-7b-chat-hf, I encountered the following error:
[rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/init.py", line 19, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/loader.py", line 402, in load_model
[rank0]: model.load_weights(self._get_all_weights(model_config, model))
[rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/models/eagle.py", line 149, in load_weights
[rank0]: raise ValueError("Found bias in the loaded weights "
[rank0]: ValueError: Found bias in the loaded weights but the model config doesn't have bias

@xiongqisong
Copy link
Author

In my case, after converting the eagle checkpoints using the script, I encountered following two errors. I only succedded in running Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination. vllm last commit: 69ba344

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

  • TP: 1
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
ERROR 12-16 05:03:28 engine.py:366]     loaded_weights = model.load_weights(
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/models/eagle.py", line 161, in load_weights
ERROR 12-16 05:03:28 engine.py:366]     raise ValueError("Found bias in the loaded weights "
ERROR 12-16 05:03:28 engine.py:366] ValueError: Found bias in the loaded weights but the model config doesn't have bias

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

  • TP: 4
    self.worker = worker_class(*args, **kwargs)
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 92, in create_spec_worker
    spec_decode_worker = SpecDecodeWorker.create_worker(
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 182, in create_worker
    raise NotImplementedError(
NotImplementedError: EAGLE does not support TP > 1 yet

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

  • TP: 1
  • Number of speculative tokens: 2
  • No errors (Run successful)
  • However, the acceptance rate is very low.
  • I benchmarked using LLMPerf by sending random inputs from sonnet.txt.
INFO 12-16 05:51:52 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.086, System efficiency: 0.354, Number of speculative tokens: 2, Number of accepted tokens: 43521, Number of draft tokens: 508066, Number of emitted tokens: 269680.

After converting the Eagle checkpoints using the script, I encountered the same error as you did. I only succeeded in running the Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination.

However, when using Llama-2-7b-chat-hf, I encountered the following error: [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/init.py", line 19, in get_model [rank0]: return loader.load_model(model_config=model_config, [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/loader.py", line 402, in load_model [rank0]: model.load_weights(self._get_all_weights(model_config, model)) [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/models/eagle.py", line 149, in load_weights [rank0]: raise ValueError("Found bias in the loaded weights " [rank0]: ValueError: Found bias in the loaded weights but the model config doesn't have bias

I think the implemention of EAGLE on vLLM have samo limitation, that limits we can't run any model with EAGLE on vLLM, we need to find it by reading more vLLM code~Fighting!

@bettybaii
Copy link

bettybaii commented Dec 16, 2024

In my case, after converting the eagle checkpoints using the script, I encountered following two errors. I only succedded in running Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination. vllm last commit: 69ba344

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

  • TP: 1
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
ERROR 12-16 05:03:28 engine.py:366]     loaded_weights = model.load_weights(
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/models/eagle.py", line 161, in load_weights
ERROR 12-16 05:03:28 engine.py:366]     raise ValueError("Found bias in the loaded weights "
ERROR 12-16 05:03:28 engine.py:366] ValueError: Found bias in the loaded weights but the model config doesn't have bias

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

  • TP: 4
    self.worker = worker_class(*args, **kwargs)
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 92, in create_spec_worker
    spec_decode_worker = SpecDecodeWorker.create_worker(
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 182, in create_worker
    raise NotImplementedError(
NotImplementedError: EAGLE does not support TP > 1 yet

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

  • TP: 1
  • Number of speculative tokens: 2
  • No errors (Run successful)
  • However, the acceptance rate is very low.
  • I benchmarked using LLMPerf by sending random inputs from sonnet.txt.
INFO 12-16 05:51:52 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.086, System efficiency: 0.354, Number of speculative tokens: 2, Number of accepted tokens: 43521, Number of draft tokens: 508066, Number of emitted tokens: 269680.

After converting the Eagle checkpoints using the script, I encountered the same error as you did. I only succeeded in running the Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination.

However, when using Llama-2-7b-chat-hf, I encountered the following error: [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/init.py", line 19, in get_model [rank0]: return loader.load_model(model_config=model_config, [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/loader.py", line 402, in load_model [rank0]: model.load_weights(self._get_all_weights(model_config, model)) [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/models/eagle.py", line 149, in load_weights [rank0]: raise ValueError("Found bias in the loaded weights " [rank0]: ValueError: Found bias in the loaded weights but the model config doesn't have bias

After adding "bias": true to the config.json file of EAGLE-llama2-chat-7B, Llama-2-7b-chat-hf was able to run normally. However, the acceptance rate still remains abnormally low.

@xiongqisong
Copy link
Author

In my case, after converting the eagle checkpoints using the script, I encountered following two errors. I only succedded in running Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination. vllm last commit: 69ba344

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

  • TP: 1
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
ERROR 12-16 05:03:28 engine.py:366]     loaded_weights = model.load_weights(
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/models/eagle.py", line 161, in load_weights
ERROR 12-16 05:03:28 engine.py:366]     raise ValueError("Found bias in the loaded weights "
ERROR 12-16 05:03:28 engine.py:366] ValueError: Found bias in the loaded weights but the model config doesn't have bias

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

  • TP: 4
    self.worker = worker_class(*args, **kwargs)
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 92, in create_spec_worker
    spec_decode_worker = SpecDecodeWorker.create_worker(
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 182, in create_worker
    raise NotImplementedError(
NotImplementedError: EAGLE does not support TP > 1 yet

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

  • TP: 1
  • Number of speculative tokens: 2
  • No errors (Run successful)
  • However, the acceptance rate is very low.
  • I benchmarked using LLMPerf by sending random inputs from sonnet.txt.
INFO 12-16 05:51:52 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.086, System efficiency: 0.354, Number of speculative tokens: 2, Number of accepted tokens: 43521, Number of draft tokens: 508066, Number of emitted tokens: 269680.

After converting the Eagle checkpoints using the script, I encountered the same error as you did. I only succeeded in running the Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination.
However, when using Llama-2-7b-chat-hf, I encountered the following error: [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/init.py", line 19, in get_model [rank0]: return loader.load_model(model_config=model_config, [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/loader.py", line 402, in load_model [rank0]: model.load_weights(self._get_all_weights(model_config, model)) [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/models/eagle.py", line 149, in load_weights [rank0]: raise ValueError("Found bias in the loaded weights " [rank0]: ValueError: Found bias in the loaded weights but the model config doesn't have bias

After adding "bias": true to the config.json file of EAGLE-llama2-chat-7B, Llama-2-7b-chat-hf was able to run normally. However, the acceptance rate still remains abnormally low.

can you show your config.json of draft model after converting?

@bettybaii
Copy link

In my case, after converting the eagle checkpoints using the script, I encountered following two errors. I only succedded in running Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination. vllm last commit: 69ba344

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

  • TP: 1
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
ERROR 12-16 05:03:28 engine.py:366]     loaded_weights = model.load_weights(
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/models/eagle.py", line 161, in load_weights
ERROR 12-16 05:03:28 engine.py:366]     raise ValueError("Found bias in the loaded weights "
ERROR 12-16 05:03:28 engine.py:366] ValueError: Found bias in the loaded weights but the model config doesn't have bias

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

  • TP: 4
    self.worker = worker_class(*args, **kwargs)
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 92, in create_spec_worker
    spec_decode_worker = SpecDecodeWorker.create_worker(
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 182, in create_worker
    raise NotImplementedError(
NotImplementedError: EAGLE does not support TP > 1 yet

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

  • TP: 1
  • Number of speculative tokens: 2
  • No errors (Run successful)
  • However, the acceptance rate is very low.
  • I benchmarked using LLMPerf by sending random inputs from sonnet.txt.
INFO 12-16 05:51:52 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.086, System efficiency: 0.354, Number of speculative tokens: 2, Number of accepted tokens: 43521, Number of draft tokens: 508066, Number of emitted tokens: 269680.

After converting the Eagle checkpoints using the script, I encountered the same error as you did. I only succeeded in running the Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination.
However, when using Llama-2-7b-chat-hf, I encountered the following error: [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/init.py", line 19, in get_model [rank0]: return loader.load_model(model_config=model_config, [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/loader.py", line 402, in load_model [rank0]: model.load_weights(self._get_all_weights(model_config, model)) [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/models/eagle.py", line 149, in load_weights [rank0]: raise ValueError("Found bias in the loaded weights " [rank0]: ValueError: Found bias in the loaded weights but the model config doesn't have bias

After adding "bias": true to the config.json file of EAGLE-llama2-chat-7B, Llama-2-7b-chat-hf was able to run normally. However, the acceptance rate still remains abnormally low.

can you show your config.json of draft model after converting?

config.json:
{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 1, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.32.0.dev0", "use_cache": true, "vocab_size": 32000, "bias": true}}

@xiongqisong
Copy link
Author

xiongqisong commented Dec 16, 2024

@bettybaii my config.json is the same with you:

{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 1, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.32.0.dev0", "use_cache": true, "vocab_size": 32000, "bias": true}}

but i even can't run vLLM, what a pity....

@xiongqisong
Copy link
Author

my config.json is the same with you:

{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 1, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.32.0.dev0", "use_cache": true, "vocab_size": 32000, "bias": true}}

but i even can't run vLLM, what a pity....
when i run the same command, i get the error:

    speculative_config = SpeculativeConfig.maybe_create_spec_config(
  File "/data1/xqs/vllm/vllm/config.py", line 1363, in maybe_create_spec_config
    draft_model_config = ModelConfig(
  File "/data1/xqs/vllm/vllm/config.py", line 210, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/data1/xqs/vllm/vllm/transformers_utils/config.py", line 192, in get_config
    config = config_class.from_pretrained(
  File "/data1/xqs/vllm/vllm/transformers_utils/configs/eagle.py", line 55, in from_pretrained
    return cls.from_dict(config_dict, **kwargs)
  File "/data1/xqs/miniconda3/envs/vllm-eagle/lib/python3.10/site-packages/transformers/configuration_utils.py", line 718, in from_dict
    config = cls(**config_dict)
  File "/data1/xqs/vllm/vllm/transformers_utils/configs/eagle.py", line 16, in __init__
    model_config = None if model is None else (AutoConfig.for_model(
  File "/data1/xqs/miniconda3/envs/vllm-eagle/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 912, in for_model
    raise ValueError(
ValueError: Unrecognized model identifier: eagle. Should contain one of albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth

@bettybaii
Copy link

bettybaii commented Dec 16, 2024

my config.json is the same with you:

{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 1, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.32.0.dev0", "use_cache": true, "vocab_size": 32000, "bias": true}}

but i even can't run vLLM, what a pity....
when i run the same command, i get the error:

    speculative_config = SpeculativeConfig.maybe_create_spec_config(
  File "/data1/xqs/vllm/vllm/config.py", line 1363, in maybe_create_spec_config
    draft_model_config = ModelConfig(
  File "/data1/xqs/vllm/vllm/config.py", line 210, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/data1/xqs/vllm/vllm/transformers_utils/config.py", line 192, in get_config
    config = config_class.from_pretrained(
  File "/data1/xqs/vllm/vllm/transformers_utils/configs/eagle.py", line 55, in from_pretrained
    return cls.from_dict(config_dict, **kwargs)
  File "/data1/xqs/miniconda3/envs/vllm-eagle/lib/python3.10/site-packages/transformers/configuration_utils.py", line 718, in from_dict
    config = cls(**config_dict)
  File "/data1/xqs/vllm/vllm/transformers_utils/configs/eagle.py", line 16, in __init__
    model_config = None if model is None else (AutoConfig.for_model(
  File "/data1/xqs/miniconda3/envs/vllm-eagle/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 912, in for_model
    raise ValueError(
ValueError: Unrecognized model identifier: eagle. Should contain one of albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth

That’s strange… I haven’t encountered this error before, and I haven’t made any modifications to the Eagle-related code. Could you try pinpointing the cause in more detail?
Here is the command I used:

python ./benchmarks/benchmark_throughput.py  \
--backend vllm \
--tensor-parallel-size 2 \
--dataset /mnt/sharegpt_gpt4.jsonl \
--model /mnt/data/models/huggingface-models/Llama-2-7b-chat-hf/ \
--speculative-model /mnt/data/models/huggingface-models/speculative-models/EAGLE-llama2-chat-7B/ \
--speculative-draft-tensor-parallel-size 1 \
--num_speculative_tokens 3 \
--enforce-eager \
--num-prompts 40

@xiongqisong
Copy link
Author

my config.json is the same with you:

{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 1, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.32.0.dev0", "use_cache": true, "vocab_size": 32000, "bias": true}}

but i even can't run vLLM, what a pity....
when i run the same command, i get the error:

    speculative_config = SpeculativeConfig.maybe_create_spec_config(
  File "/data1/xqs/vllm/vllm/config.py", line 1363, in maybe_create_spec_config
    draft_model_config = ModelConfig(
  File "/data1/xqs/vllm/vllm/config.py", line 210, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/data1/xqs/vllm/vllm/transformers_utils/config.py", line 192, in get_config
    config = config_class.from_pretrained(
  File "/data1/xqs/vllm/vllm/transformers_utils/configs/eagle.py", line 55, in from_pretrained
    return cls.from_dict(config_dict, **kwargs)
  File "/data1/xqs/miniconda3/envs/vllm-eagle/lib/python3.10/site-packages/transformers/configuration_utils.py", line 718, in from_dict
    config = cls(**config_dict)
  File "/data1/xqs/vllm/vllm/transformers_utils/configs/eagle.py", line 16, in __init__
    model_config = None if model is None else (AutoConfig.for_model(
  File "/data1/xqs/miniconda3/envs/vllm-eagle/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 912, in for_model
    raise ValueError(
ValueError: Unrecognized model identifier: eagle. Should contain one of albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth

That’s strange… I haven’t encountered this error before, and I haven’t made any modifications to the Eagle-related code. Could you try pinpointing the cause in more detail? Here is the command I used:

python ./benchmarks/benchmark_throughput.py  \
--backend vllm \
--tensor-parallel-size 2 \
--dataset /mnt/sharegpt_gpt4.jsonl \
--model /mnt/data/models/huggingface-models/Llama-2-7b-chat-hf/ \
--speculative-model /mnt/data/models/huggingface-models/speculative-models/EAGLE-llama2-chat-7B/ \
--speculative-draft-tensor-parallel-size 1 \
--num_speculative_tokens 3 \
--enforce-eager \
--num-prompts 40

the point is here, vLLM use hf's AutoConfig to init EAGLEConfig, but the model_type is eagle, hf's transformers lib doesn't support "eagle" model:
图片

@bettybaii
Copy link

bettybaii commented Dec 16, 2024

my config.json is the same with you:

{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 1, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.32.0.dev0", "use_cache": true, "vocab_size": 32000, "bias": true}}

but i even can't run vLLM, what a pity....
when i run the same command, i get the error:

    speculative_config = SpeculativeConfig.maybe_create_spec_config(
  File "/data1/xqs/vllm/vllm/config.py", line 1363, in maybe_create_spec_config
    draft_model_config = ModelConfig(
  File "/data1/xqs/vllm/vllm/config.py", line 210, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/data1/xqs/vllm/vllm/transformers_utils/config.py", line 192, in get_config
    config = config_class.from_pretrained(
  File "/data1/xqs/vllm/vllm/transformers_utils/configs/eagle.py", line 55, in from_pretrained
    return cls.from_dict(config_dict, **kwargs)
  File "/data1/xqs/miniconda3/envs/vllm-eagle/lib/python3.10/site-packages/transformers/configuration_utils.py", line 718, in from_dict
    config = cls(**config_dict)
  File "/data1/xqs/vllm/vllm/transformers_utils/configs/eagle.py", line 16, in __init__
    model_config = None if model is None else (AutoConfig.for_model(
  File "/data1/xqs/miniconda3/envs/vllm-eagle/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 912, in for_model
    raise ValueError(
ValueError: Unrecognized model identifier: eagle. Should contain one of albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth

That’s strange… I haven’t encountered this error before, and I haven’t made any modifications to the Eagle-related code. Could you try pinpointing the cause in more detail? Here is the command I used:

python ./benchmarks/benchmark_throughput.py  \
--backend vllm \
--tensor-parallel-size 2 \
--dataset /mnt/sharegpt_gpt4.jsonl \
--model /mnt/data/models/huggingface-models/Llama-2-7b-chat-hf/ \
--speculative-model /mnt/data/models/huggingface-models/speculative-models/EAGLE-llama2-chat-7B/ \
--speculative-draft-tensor-parallel-size 1 \
--num_speculative_tokens 3 \
--enforce-eager \
--num-prompts 40

the point is here, vLLM use hf's AutoConfig to init EAGLEConfig, but the model_type is eagle, hf's transformers lib doesn't support "eagle" model: 图片

I haven’t encountered this issue; could it be related to the environment? My transformers version is 4.45.2.
After successfully running the code, I used print("EAGLEConfig", model_config) to output the result of model_config, which was as follows:

EAGLEConfig LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bias": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 1,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.45.2",
  "use_cache": true,
  "vocab_size": 32000
}

@xiongqisong
Copy link
Author

xiongqisong commented Dec 17, 2024

@bettybaii
After pull the vLLM latest code, downgrade transformers version(the same as you), and change the code:

--- a/vllm/transformers_utils/configs/eagle.py
+++ b/vllm/transformers_utils/configs/eagle.py
@@ -11,7 +11,8 @@ class EAGLEConfig(PretrainedConfig):
                  model: Union[PretrainedConfig, dict, None] = None,
                  truncated_vocab_size: Optional[int] = None,
                  **kwargs):
-
+        if model is not None:
+            model['model_type']='llama'
         model_config = None if model is None else (AutoConfig.for_model(
             **model) if isinstance(model, dict) else model)

and shutdown torch dynamo compile, i can run EAGLE on vLLM base on llama-2:

import torch._dynamo
torch._dynamo.config.suppress_errors = True

if i don't shutdown torch dynamo compile, the propose stage will crash, i haven't figure the reason out.
And also, i change the config.json of draft model, because the bias problem, new version vLLM change the config name:

{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "eagle", "num_attention_heads": 32, "num_hidden_layers": 1, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.32.0.dev0", "use_cache": true, "vocab_size": 32000, "eagle_fc_bias": true}}

I add the 'eagle_fc_bias' at the end.
I find the Draft acceptance rate is 0.418 which is low, i think the reason is mentioned at #9565 (comment), EAGLE not just use one decode layer like target model, it has different pre process/middle process than original decoder layer, i think the fix haven't merge to the main branch, and now the code can't run EAGLE by no change.
The special handle of EAGLE is:

  • after combine the token generate by target model, cut first token of sequence off, then embedding sequcen, combine embedding and hiddent state as input to EAGLE
  • don't use input layernorm at decoder layer

@bettybaii
Copy link

@xiongqisong
Since I’m not very familiar with Eagle, I’m wondering if the special handling of EAGLE you mentioned is easy to modify within vllm? In addition, I noticed that after conversion, the draft model’s parameter size increased significantly (from 1.55GB to 3.4048GB), which resulted in a substantial increase in GPU memory consumption and considerably extended the computation time (with the average_time_per_proposal_tok_ms reaching nearly 4 ms). Are you interested in addressing the issue of efficiently using Eagle in vllm? We could potentially work together on this.

@xiongqisong
Copy link
Author

@xiongqisong Since I’m not very familiar with Eagle, I’m wondering if the special handling of EAGLE you mentioned is easy to modify within vllm? In addition, I noticed that after conversion, the draft model’s parameter size increased significantly (from 1.55GB to 3.4048GB), which resulted in a substantial increase in GPU memory consumption and considerably extended the computation time (with the average_time_per_proposal_tok_ms reaching nearly 4 ms). Are you interested in addressing the issue of efficiently using Eagle in vllm? We could potentially work together on this.

I'm familiar with EAGLE, i have implemented chatglm/bluelm on EAGLE original github, and discuss some problem with EAGLE's author(liyuhui), i'm glab to help address the issue, althrough i'm not very familar with vLLM code, i can learn and modify vLLM code by this chance.

Env and observed phenomena

latest vLLM code
machine: v100, 1GPU, 32GB memory
target model: Llama-2-7b-chat-hf
draft model: https://huggingface.co/yuhuili/EAGLE-llama2-chat-7B which is from yuhuili's github
draft model parameters: 0.24B(Theoretically consumes 0.96GB memory of GPU in fp32, 0.48GB in bf16)
load draft model weights: 0.9278GB(i think is normal)
load target model weights: 12.53GB(i think is normal, mixed precision)
average_time_per_proposal_tok_ms: 3.11ms(this is wierd)
scoring_time_ms: 26.33ms(this is wierd,too, but i run target model singly, the time is the same, so maybe vLLM's statistical logic is wrong)
verification_time_ms: 1.4ms
speculative tokens: 4
draft acceptance rate: 0.421 (means 42.1%???)

clues

图片
图片
图片

@xiongqisong
Copy link
Author

@bettybaii @llsj14
From now on, i find some problem on EAGLE's implemention in vLLM:

EAGLE model's architecture is wrong, with 5 unnessary module

the right architecture in original EAGLE github is:
图片
the wrong architecture in vLLM is:

EAGLE(
(model): LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): VocabParallelEmbedding(num_embeddings=32000, embedding_dim=4096, org_vocab_size=32000, num_embeddings_padded=32000, tp_size=1)
(layers): ModuleList(
(0): LlamaDecoderLayer(
(self_attn): LlamaAttention(
(qkv_proj): QKVParallelLinear(in_features=4096, output_features=12288, bias=False, tp_size=1, gather_output=False)
(o_proj): RowParallelLinear(input_features=4096, output_features=4096, bias=False, tp_size=1, reduce_results=True)
(rotary_emb): RotaryEmbedding(head_size=128, rotary_dim=128, max_position_embeddings=4096, base=10000.0, is_neox_style=True)
(attn): Attention(head_size=128, num_heads=32, num_kv_heads=32, scale=0.08838834764831845, backend=XFormersImpl)
)
(mlp): LlamaMLP(
(gate_up_proj): MergedColumnParallelLinear(in_features=4096, output_features=22016, bias=False, tp_size=1, gather_output=False)
(down_proj): RowParallelLinear(input_features=11008, output_features=4096, bias=False, tp_size=1, reduce_results=True)
(act_fn): SiluAndMul()
)
(input_layernorm): RMSNorm(hidden_size=4096, eps=1e-06)
(post_attention_layernorm): RMSNorm(hidden_size=4096, eps=1e-06)
)
)
(norm): RMSNorm(hidden_size=4096, eps=1e-06)
)
(lm_head): ParallelLMHead(num_embeddings=32000, embedding_dim=4096, org_vocab_size=32000, num_embeddings_padded=32000, tp_size=1)
(logits_processor): LogitsProcessor(vocab_size=32000, forg_vocab_size=32000, scale=1.0, logits_as_input=False)
(sampler): Sampler()
)
(fc): Linear(in_features=8192, out_features=4096, bias=True)
(lm_head): ParallelLMHead(num_embeddings=32000, embedding_dim=4096, org_vocab_size=32000, num_embeddings_padded=32000, tp_size=1)
(logits_processor): LogitsProcessor(vocab_size=32000, forg_vocab_size=32000, scale=1.0, logits_as_input=False)

PS: I can't upload too big image, so i use delete line instead.

code error

  • not use target model generate token as input to EAGLE draft model, the right way is concat seq+first generate token,then remove the first token(not first generate token, but the first token of seq), do embedding on new seq, then concat embedding with hidden state(calculate by target model)
  • not need inputer layernorm, EAGLE only use 1 decode layer of target model, not all the model.
  • not forward EAGLE x time to generate a proposal tree, the 'x' is the deepth of tree designed by user.

other error

  • even no proposal tokens be accepted by target model, the accept rate isn't 0, because vLLM consider target model generated token as both denominator and numerator, that makes use feel confused.

@llsj14
Copy link
Contributor

llsj14 commented Dec 19, 2024

@xiongqisong
I believe this is a great finding. Perhaps we can discuss it further in a new issue. I would also like to delve deeper into the implementation details of EAGLE on vLLM. Could you create another issue so we can discuss this further?

@llsj14
Copy link
Contributor

llsj14 commented Dec 19, 2024

@xiongqisong @bettybaii
I also tested after reading your comments. It seems that the issue with running the EAGLE model was solely due to the missing "eagle_fc_bias" key, rather than any other configuration settings. (I suggested changes to the script.)

For the EAGLE-LLaMA3-Instruct-70B model, I had to add the option --speculative-draft-tensor-parallel-size 1 because vLLM does not yet support TP > 1 for the EAGLE model.

I have also added experimental results and settings based on your help.

@sroy745
I hope this summary helps with your documentation on EAGLE.

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

  • eagle config.json
{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 1, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.32.0.dev0", "use_cache": true, "vocab_size": 32000, "eagle_fc_bias": true}}
  • vLLM running script
#!/bin/bash
python -m vllm.entrypoints.openai.api_server \
     --model /.../Llama-2-7b-chat-hf \
     --port 8089 \
     --disable-custom-all-reduce \
     --swap-space 0 \
     --gpu-memory-utilization 0.9 \
     --speculative-model /.../EAGLE-llama2-chat-7B \
     --num-speculative-tokens 1
  • accept rate on C4 dataset (input length < 1024, output length = 128)
INFO 12-19 05:27:19 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.246, System efficiency: 0.623, Number of speculative tokens: 1, Number of accepted tokens: 12915, Number of draft tokens: 52434, Number of emitted tokens: 65349.

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

  • eagle config.json
{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 128000, "eos_token_id": 128001, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 28672, "max_position_embeddings": 2048, "model_type": "llama", "num_attention_heads": 64, "num_key_value_heads": 8, "num_hidden_layers": 1, "pad_token_id": 0, "rms_norm_eps": 1e-05, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.28.1", "use_cache": true, "vocab_size": 128256, "rope_theta": 500000.0, "bias": false}}
  • vLLM running script
#!/bin/bash
python -m vllm.entrypoints.openai.api_server \
     --model /.../Llama-3.1-70B-Instruct \
     --port 8089 \
     --disable-custom-all-reduce \
     --swap-space 0 \
     --gpu-memory-utilization 0.9 \
     --speculative-model /.../EAGLE-LLaMA3-Instruct-70B \
     --num-speculative-tokens 1 \
     -tp 4 \
     --speculative-draft-tensor-parallel-size 1
  • accept rate on C4 dataset (input length < 1024, output length = 128)
INFO 12-19 05:20:29 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.242, System efficiency: 0.621, Number of speculative tokens: 1, Number of accepted tokens: 12932, Number of draft tokens: 53445, Number of emitted tokens: 66377.

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

  • eagle config.json
{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 128000, "eos_token_id": 128001, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 2048, "model_type": "llama", "num_attention_heads": 32, "num_key_value_heads": 8, "num_hidden_layers": 1, "pad_token_id": 0, "rms_norm_eps": 1e-05, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.28.1", "use_cache": true, "vocab_size": 128256, "rope_theta": 500000.0, "bias": false}}
  • vLLM running script
python -m vllm.entrypoints.openai.api_server \
     --model /mnt/lvm/checkpoints/hugginface/Meta-Llama-3-8B \
     --port 8087 \
     --disable-custom-all-reduce \
     --swap-space 0 \
     --gpu-memory-utilization 0.9 \
     --speculative-model /mnt/lvm/checkpoints/hugginface/EAGLE-LLaMA3-Instruct-8B \
     --num-speculative-tokens 1
  • accept rate on C4 dataset (input length < 1024, output length = 128)
INFO 12-19 05:31:51 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.315, System efficiency: 0.657, Number of speculative tokens: 1, Number of accepted tokens: 16182, Number of draft tokens: 51400, Number of emitted tokens: 67582.

@bettybaii
Copy link

@xiongqisong
That’s great! It looks like this may be the root cause of the abnormal behavior of Eagle. I will further investigate the implementation of the original Eagle and the version of Eagle in vllm based on your findings.

@sroy745
Copy link
Collaborator

sroy745 commented Dec 19, 2024

Hi @llsj14 thanks for the detailed note. I will add this to the documentation shortly. @abhigoyal1997 wondering if we can add your script to the vLLM repo? That way it might be slightly easier to make and track changes?

@xiongqisong / @bettybaii as suggested by @llsj14 wondering if we can use this #9565 to track vLLM's Eagle performance? This tracks some of investigation done till now.

@xiongqisong
Copy link
Author

Hi @llsj14 thanks for the detailed note. I will add this to the documentation shortly. @abhigoyal1997 wondering if we can add your script to the vLLM repo? That way it might be slightly easier to make and track changes?

@xiongqisong / @bettybaii as suggested by @llsj14 wondering if we can use this #9565 to track vLLM's Eagle performance? This tracks some of investigation done till now.

ok, we can talk the problems of EAGLE implemention on vLLM at #9565

@xiongqisong
Copy link
Author

@xiongqisong I believe this is a great finding. Perhaps we can discuss it further in a new issue. I would also like to delve deeper into the implementation details of EAGLE on vLLM. Could you create another issue so we can discuss this further?

Glad to help, let's talk related questions at issue #9565

@xiongqisong
Copy link
Author

@xiongqisong That’s great! It looks like this may be the root cause of the abnormal behavior of Eagle. I will further investigate the implementation of the original Eagle and the version of Eagle in vllm based on your findings.

Glad to help, i'm modifing the vLLM code to make EAGLE run correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

4 participants