-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage]: how to use EAGLE on vLLM? #11126
Comments
I encountered the same problem. @sroy745 @LiuXiaoxuanPKU Commandpython -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--enable-prefix-caching \
--port 8088 \
--disable-custom-all-reduce \
--swap-space 0 \
--gpu-memory-utilization 0.9 \
--speculative-model yuhuili/EAGLE-llama2-chat-7B \
--num-speculative-tokens 2 python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--enable-prefix-caching \
--port 8088 \
--disable-custom-all-reduce \
--swap-space 0 \
--gpu-memory-utilization 0.9 \
--speculative-modelyuhuili/EAGLE-LLaMA3-Instruct-70B \
--num-speculative-tokens 2 \
-tp 4 Error message self.proposer_worker.load_model()
File "/vllm/vllm/worker/worker.py", line 155, in load_model
self.model_runner.load_model()
File "/vllm/vllm/worker/model_runner.py", line 1093, in load_model
self.model = get_model(vllm_config=self.vllm_config)
File "/vllm/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
return loader.load_model(vllm_config=vllm_config)
File "/vllm/vllm/model_executor/model_loader/loader.py", line 370, in load_model
loaded_weights = model.load_weights(
File "/vllm/vllm/model_executor/models/llama.py", line 594, in load_weights
return loader.load_weights(
File "/vllm/vllm/model_executor/models/utils.py", line 242, in load_weights
autoloaded_weights = set(self._load_module("", self.module, weights))
File "/vllm/vllm/model_executor/models/utils.py", line 229, in _load_module
raise ValueError(msg)
ValueError: There is no module or parameter named 'embed_tokens' in LlamaForCausalLM What I tried--- a/vllm/model_executor/model_loader/weight_utils.py
+++ b/vllm/model_executor/model_loader/weight_utils.py
@@ -414,6 +414,7 @@ def pt_weights_iterator(
hf_weights_files: List[str]
) -> Generator[Tuple[str, torch.Tensor], None, None]:
"""Iterate over the weights in the model bin/pt files."""
+ #print("hf_weights_files: ", hf_weights_files)
enable_tqdm = not torch.distributed.is_initialized(
) or torch.distributed.get_rank() == 0
for bin_file in tqdm(
@@ -423,6 +424,21 @@ def pt_weights_iterator(
bar_format=_BAR_FORMAT,
):
state = torch.load(bin_file, map_location="cpu")
+
+ def transform_key(key: str) -> str:
+ """Transforms keys based on predefined rules."""
+ if key.startswith("embed_tokens"):
+ return "model." + key # Add 'model.' prefix for 'embed_tokens'
+ if key.startswith("fc"):
+ return None # Llama does not support fc layers
+ return key
+
+ state = {
+ new_key: weight_value
+ for weight_name, weight_value in state.items()
+ if (new_key := transform_key(weight_name)) is not None
+ }
+
+ str_values = [weight_name for weight_name, _ in state]
+ print("Extracted str values:", str_values)
yield from state.items()
del state
torch.cuda.empty_cache() |
I use vscode to debug EAGLE on vLLM, here is the debug config of mine:
the config.json file of draft model (/data1/xqs/EAGLE-llama) is:
the files contained in draft model folder is:the convert script to change EAGLE draft model weights to vLLM format weights is:
the conversion script is referenced from https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d,the author support EAGLE on vLLM, but he didn't show a full usage case of EAGLE on vLLM, that's confused~ the error i faced is:
It seems like the vLLM doesn't run load_weights method in EAGLE class which code is in model_executor/models/eagle.py? It's so weird~ |
In my case, after converting the eagle checkpoints using the script, I encountered following two errors. Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B
ERROR 12-16 05:03:28 engine.py:366] File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
ERROR 12-16 05:03:28 engine.py:366] loaded_weights = model.load_weights(
ERROR 12-16 05:03:28 engine.py:366] File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/models/eagle.py", line 161, in load_weights
ERROR 12-16 05:03:28 engine.py:366] raise ValueError("Found bias in the loaded weights "
ERROR 12-16 05:03:28 engine.py:366] ValueError: Found bias in the loaded weights but the model config doesn't have bias Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B
self.worker = worker_class(*args, **kwargs)
File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 92, in create_spec_worker
spec_decode_worker = SpecDecodeWorker.create_worker(
File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 182, in create_worker
raise NotImplementedError(
NotImplementedError: EAGLE does not support TP > 1 yet Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B
|
@abhigoyal1997 |
I think you can put the target model and draft model run in the EAGLE original github code, if the accelerate is normal, like the accelerate in article(2~4x), i think the problem shouldn't be EAGLE but vLLM. |
After converting the Eagle checkpoints using the script, I encountered the same error as you did. I only succeeded in running the Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination. However, when using Llama-2-7b-chat-hf, I encountered the following error: |
I think the implemention of EAGLE on vLLM have samo limitation, that limits we can't run any model with EAGLE on vLLM, we need to find it by reading more vLLM code~Fighting! |
After adding |
can you show your config.json of draft model after converting? |
config.json: |
@bettybaii my config.json is the same with you:
but i even can't run vLLM, what a pity.... |
|
That’s strange… I haven’t encountered this error before, and I haven’t made any modifications to the Eagle-related code. Could you try pinpointing the cause in more detail? python ./benchmarks/benchmark_throughput.py \
--backend vllm \
--tensor-parallel-size 2 \
--dataset /mnt/sharegpt_gpt4.jsonl \
--model /mnt/data/models/huggingface-models/Llama-2-7b-chat-hf/ \
--speculative-model /mnt/data/models/huggingface-models/speculative-models/EAGLE-llama2-chat-7B/ \
--speculative-draft-tensor-parallel-size 1 \
--num_speculative_tokens 3 \
--enforce-eager \
--num-prompts 40 |
@bettybaii
and shutdown torch dynamo compile, i can run EAGLE on vLLM base on llama-2:
if i don't shutdown torch dynamo compile, the propose stage will crash, i haven't figure the reason out.
I add the 'eagle_fc_bias' at the end.
|
@xiongqisong |
I'm familiar with EAGLE, i have implemented chatglm/bluelm on EAGLE original github, and discuss some problem with EAGLE's author(liyuhui), i'm glab to help address the issue, althrough i'm not very familar with vLLM code, i can learn and modify vLLM code by this chance. Env and observed phenomenalatest vLLM code clues |
@bettybaii @llsj14 EAGLE model's architecture is wrong, with 5 unnessary modulethe right architecture in original EAGLE github is: EAGLE( PS: I can't upload too big image, so i use delete line instead. code error
other error
|
@xiongqisong |
@xiongqisong @bettybaii For the EAGLE-LLaMA3-Instruct-70B model, I had to add the option --speculative-draft-tensor-parallel-size 1 because vLLM does not yet support TP > 1 for the EAGLE model. I have also added experimental results and settings based on your help. @sroy745 Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B
INFO 12-19 05:27:19 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.246, System efficiency: 0.623, Number of speculative tokens: 1, Number of accepted tokens: 12915, Number of draft tokens: 52434, Number of emitted tokens: 65349. Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B
INFO 12-19 05:20:29 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.242, System efficiency: 0.621, Number of speculative tokens: 1, Number of accepted tokens: 12932, Number of draft tokens: 53445, Number of emitted tokens: 66377. Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B
INFO 12-19 05:31:51 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.315, System efficiency: 0.657, Number of speculative tokens: 1, Number of accepted tokens: 16182, Number of draft tokens: 51400, Number of emitted tokens: 67582. |
@xiongqisong |
Hi @llsj14 thanks for the detailed note. I will add this to the documentation shortly. @abhigoyal1997 wondering if we can add your script to the vLLM repo? That way it might be slightly easier to make and track changes? @xiongqisong / @bettybaii as suggested by @llsj14 wondering if we can use this #9565 to track vLLM's Eagle performance? This tracks some of investigation done till now. |
ok, we can talk the problems of EAGLE implemention on vLLM at #9565 |
Glad to help, let's talk related questions at issue #9565 |
Glad to help, i'm modifing the vLLM code to make EAGLE run correctly. |
Your current environment
How would you like to use vllm
I want to test EAGLE on vllm, but i try so many methods to run EAGLE, fail so many times.
The target model is Llama2-chat-hf, and the draft model is EAGLE-Llama2-chat in original EAGLE's author's github.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: