[Usage]: how to use EAGLE on vLLM? #11126

xiongqisong · 2024-12-12T08:16:44Z

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

I want to test EAGLE on vllm, but i try so many methods to run EAGLE, fail so many times.
The target model is Llama2-chat-hf, and the draft model is EAGLE-Llama2-chat in original EAGLE's author's github.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

llsj14 · 2024-12-16T01:21:10Z

I encountered the same problem.
I tried the following commands, but they resulted in errors related to loading EAGLE models.
After modifying the pt_weights_iterator function, I managed to load the models. However, the accept rate is very low (~0.068).

@sroy745 @LiuXiaoxuanPKU
Could you provide some guidance on running EAGLE models on vLLM? Should I convert the original EAGLE model? Any tips would be greatly appreciated!

Command

python -m vllm.entrypoints.openai.api_server \
     --model meta-llama/Llama-2-7b-chat-hf \
     --enable-prefix-caching \
     --port 8088 \
     --disable-custom-all-reduce \
     --swap-space 0 \
     --gpu-memory-utilization 0.9 \
     --speculative-model yuhuili/EAGLE-llama2-chat-7B \
     --num-speculative-tokens 2

python -m vllm.entrypoints.openai.api_server \
     --model meta-llama/Llama-3.1-70B-Instruct \
     --enable-prefix-caching \
     --port 8088 \
     --disable-custom-all-reduce \
     --swap-space 0 \
     --gpu-memory-utilization 0.9 \
     --speculative-modelyuhuili/EAGLE-LLaMA3-Instruct-70B \
     --num-speculative-tokens 2 \
     -tp 4

Error message

    self.proposer_worker.load_model()
  File "/vllm/vllm/worker/worker.py", line 155, in load_model
    self.model_runner.load_model()
  File "/vllm/vllm/worker/model_runner.py", line 1093, in load_model
    self.model = get_model(vllm_config=self.vllm_config)
  File "/vllm/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
    return loader.load_model(vllm_config=vllm_config)
  File "/vllm/vllm/model_executor/model_loader/loader.py", line 370, in load_model
    loaded_weights = model.load_weights(
  File "/vllm/vllm/model_executor/models/llama.py", line 594, in load_weights
    return loader.load_weights(
  File "/vllm/vllm/model_executor/models/utils.py", line 242, in load_weights
    autoloaded_weights = set(self._load_module("", self.module, weights))
  File "/vllm/vllm/model_executor/models/utils.py", line 229, in _load_module
    raise ValueError(msg)
ValueError: There is no module or parameter named 'embed_tokens' in LlamaForCausalLM

What I tried

--- a/vllm/model_executor/model_loader/weight_utils.py
+++ b/vllm/model_executor/model_loader/weight_utils.py
@@ -414,6 +414,7 @@ def pt_weights_iterator(
     hf_weights_files: List[str]
 ) -> Generator[Tuple[str, torch.Tensor], None, None]:
     """Iterate over the weights in the model bin/pt files."""
+    #print("hf_weights_files: ", hf_weights_files)
     enable_tqdm = not torch.distributed.is_initialized(
     ) or torch.distributed.get_rank() == 0
     for bin_file in tqdm(
@@ -423,6 +424,21 @@ def pt_weights_iterator(
             bar_format=_BAR_FORMAT,
     ):
         state = torch.load(bin_file, map_location="cpu")
+
+        def transform_key(key: str) -> str:
+            """Transforms keys based on predefined rules."""
+            if key.startswith("embed_tokens"):
+                return "model." + key  # Add 'model.' prefix for 'embed_tokens'
+            if key.startswith("fc"):
+                return None  # Llama does not support fc layers
+            return key
+
+        state = {
+            new_key: weight_value
+            for weight_name, weight_value in state.items()
+            if (new_key := transform_key(weight_name)) is not None
+        }
+
+        str_values = [weight_name for weight_name, _ in state]
+        print("Extracted str values:", str_values)
         yield from state.items()
         del state
         torch.cuda.empty_cache()

xiongqisong · 2024-12-16T03:01:36Z

I use vscode to debug EAGLE on vLLM, here is the debug config of mine:

{
    "name": "debug eagle latency",
    "type": "debugpy",
    "request": "launch",
    "program": "benchmarks/benchmark_latency.py",
    "console": "integratedTerminal",
    "args": [
        "--model",
        "/data/xqs/model/Llama-2-7b-chat-hf",
        "--speculative-model",
        "/data1/xqs/EAGLE-llama",
        "--num_speculative_tokens",
        "4",
        "--use-v2-block-manager",
        "--batch-size",
        "1",
        "--input-len",
        "1024",
        "--output-len",
        "128",
        "--max-model-len",
        "2048"
    ],
    "justMyCode": false
}

the config.json file of draft model (/data1/xqs/EAGLE-llama) is:

{
	"model_type": "eagle",
	"model": {
		"architectures": ["LlamaForCausalLM"],
		"bos_token_id": 1,
		"eos_token_id": 2,
		"hidden_act": "silu",
		"hidden_size": 4096,
		"initializer_range": 0.02,
		"intermediate_size": 11008,
		"max_position_embeddings": 4096,
		"model_type": "eagle",
		"num_attention_heads": 32,
		"num_hidden_layers": 1,
		"num_key_value_heads": 32,
		"pretraining_tp": 1,
		"rms_norm_eps": 1e-06,
		"rope_scaling": null,
		"tie_word_embeddings": false,
		"torch_dtype": "float16",
		"transformers_version": "4.32.0.dev0",
		"use_cache": true,
		"vocab_size": 32000
	}
}

the files contained in draft model folder is:

the convert script to change EAGLE draft model weights to vLLM format weights is:

import json

import torch
from safetensors.torch import load_file, save_file

ckpt = torch.load("/data/model/EAGLE-llama2-chat-7B/pytorch_model.bin")
ref_ckpt = load_file("/data/model/Llama-2-7b-chat-hf/model-00002-of-00002.safetensors")

ckpt['lm_head.weight'] = ref_ckpt['lm_head.weight']

save_file(ckpt, "/data1/xqs/EAGLE-llama/model.safetensors")

with open("/data/model/EAGLE-llama2-chat-7B/config.json") as rf:
    cfg = json.load(rf)

cfg = {"model_type": "eagle", "model": cfg}

with open("/data1/xqs/EAGLE-llama/config.json", "w") as wf:
    json.dump(cfg, wf)

# delete EAGLE-LLaMA3-Instruct-8B/pytorch_model.bin

the conversion script is referenced from https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d,the author support EAGLE on vLLM, but he didn't show a full usage case of EAGLE on vLLM, that's confused~

the error i faced is:

[rank0]:   File "/data1/xqs/vllm/vllm/executor/gpu_executor.py", line 39, in _init_executor
[rank0]:     self.driver_worker.init_device()
[rank0]:   File "/data1/xqs/vllm/vllm/spec_decode/spec_decode_worker.py", line 307, in init_device
[rank0]:     self.proposer_worker.load_model()
[rank0]:   File "/data1/xqs/vllm/vllm/worker/worker.py", line 152, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/data1/xqs/vllm/vllm/worker/model_runner.py", line 1066, in load_model
[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
[rank0]:   File "/data1/xqs/vllm/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
[rank0]:     return loader.load_model(vllm_config=vllm_config)
[rank0]:   File "/data1/xqs/vllm/vllm/model_executor/model_loader/loader.py", line 305, in load_model
[rank0]:     model.load_weights(self._get_all_weights(model_config, model))
[rank0]:   File "/data1/xqs/vllm/vllm/model_executor/models/llama.py", line 581, in load_weights
[rank0]:     loader.load_weights(
[rank0]:   File "/data1/xqs/vllm/vllm/model_executor/models/utils.py", line 230, in load_weights
[rank0]:     autoloaded_weights = list(self._load_module("", self.module, weights))
[rank0]:   File "/data1/xqs/vllm/vllm/model_executor/models/utils.py", line 219, in _load_module
[rank0]:     raise ValueError(msg)
[rank0]: ValueError: There is no module or parameter named 'embed_tokens' in LlamaForCausalLM

It seems like the vLLM doesn't run load_weights method in EAGLE class which code is in model_executor/models/eagle.py? It's so weird~

llsj14 · 2024-12-16T05:12:36Z

In my case, after converting the eagle checkpoints using the script, I encountered following two errors.
I only succedded in running Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination.
vllm last commit: 69ba344

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

TP: 1

ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
ERROR 12-16 05:03:28 engine.py:366]     loaded_weights = model.load_weights(
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/models/eagle.py", line 161, in load_weights
ERROR 12-16 05:03:28 engine.py:366]     raise ValueError("Found bias in the loaded weights "
ERROR 12-16 05:03:28 engine.py:366] ValueError: Found bias in the loaded weights but the model config doesn't have bias

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

TP: 4

    self.worker = worker_class(*args, **kwargs)
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 92, in create_spec_worker
    spec_decode_worker = SpecDecodeWorker.create_worker(
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 182, in create_worker
    raise NotImplementedError(
NotImplementedError: EAGLE does not support TP > 1 yet

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

TP: 1
Number of speculative tokens: 2
No errors (Run successful)
However, the acceptance rate is very low.
I benchmarked using LLMPerf by sending random inputs from sonnet.txt.

INFO 12-16 05:51:52 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.086, System efficiency: 0.354, Number of speculative tokens: 2, Number of accepted tokens: 43521, Number of draft tokens: 508066, Number of emitted tokens: 269680.

llsj14 · 2024-12-16T05:56:38Z

@abhigoyal1997
I am grateful for your Eagle checkpoint converter, which allowed me to successfully run the Llama 3 combinations as described above. Do you have any updates on your converters, and do you think the acceptance rate of the Llama 3 combinations is reasonable?

xiongqisong · 2024-12-16T06:28:13Z

@abhigoyal1997 I am grateful for your Eagle checkpoint converter, which allowed me to successfully run the Llama 3 combinations as described above. Do you have any updates on your converters, and do you think the acceptance rate of the Llama 3 combinations is reasonable?

I think you can put the target model and draft model run in the EAGLE original github code, if the accelerate is normal, like the accelerate in article(2~4x), i think the problem shouldn't be EAGLE but vLLM.
I try so many model and draft model on EAGLE original github code, work fine, it's so hard to use EAGLE on vLLM.

xiongqisong · 2024-12-16T06:37:23Z

I think the code in vllm/transformers_utils/configs/eagle.py line 14 is wrong:

because hf transformers doesn't support model 'eagle', so we can't use AutoConfig to config EAGLE draft model.
My transformers version is 4.46.3, and i can use vLLM to run target model(or call it original model)——the Llama-2-7b-chat-hf.

bettybaii · 2024-12-16T07:36:06Z

In my case, after converting the eagle checkpoints using the script, I encountered following two errors. I only succedded in running Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination. vllm last commit: 69ba344

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

TP: 1

ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
ERROR 12-16 05:03:28 engine.py:366]     loaded_weights = model.load_weights(
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/models/eagle.py", line 161, in load_weights
ERROR 12-16 05:03:28 engine.py:366]     raise ValueError("Found bias in the loaded weights "
ERROR 12-16 05:03:28 engine.py:366] ValueError: Found bias in the loaded weights but the model config doesn't have bias

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

TP: 4

    self.worker = worker_class(*args, **kwargs)
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 92, in create_spec_worker
    spec_decode_worker = SpecDecodeWorker.create_worker(
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 182, in create_worker
    raise NotImplementedError(
NotImplementedError: EAGLE does not support TP > 1 yet

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

TP: 1
Number of speculative tokens: 2
No errors (Run successful)
However, the acceptance rate is very low.
I benchmarked using LLMPerf by sending random inputs from sonnet.txt.

INFO 12-16 05:51:52 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.086, System efficiency: 0.354, Number of speculative tokens: 2, Number of accepted tokens: 43521, Number of draft tokens: 508066, Number of emitted tokens: 269680.

After converting the Eagle checkpoints using the script, I encountered the same error as you did. I only succeeded in running the Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination.

However, when using Llama-2-7b-chat-hf, I encountered the following error:
[rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/init.py", line 19, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/loader.py", line 402, in load_model
[rank0]: model.load_weights(self._get_all_weights(model_config, model))
[rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/models/eagle.py", line 149, in load_weights
[rank0]: raise ValueError("Found bias in the loaded weights "
[rank0]: ValueError: Found bias in the loaded weights but the model config doesn't have bias

xiongqisong · 2024-12-16T08:03:50Z

In my case, after converting the eagle checkpoints using the script, I encountered following two errors. I only succedded in running Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination. vllm last commit: 69ba344

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

TP: 1
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
ERROR 12-16 05:03:28 engine.py:366]     loaded_weights = model.load_weights(
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/models/eagle.py", line 161, in load_weights
ERROR 12-16 05:03:28 engine.py:366]     raise ValueError("Found bias in the loaded weights "
ERROR 12-16 05:03:28 engine.py:366] ValueError: Found bias in the loaded weights but the model config doesn't have bias
Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

TP: 4
    self.worker = worker_class(*args, **kwargs)
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 92, in create_spec_worker
    spec_decode_worker = SpecDecodeWorker.create_worker(
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 182, in create_worker
    raise NotImplementedError(
NotImplementedError: EAGLE does not support TP > 1 yet
Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

TP: 1

Number of speculative tokens: 2

No errors (Run successful)

However, the acceptance rate is very low.

I benchmarked using LLMPerf by sending random inputs from sonnet.txt.
INFO 12-16 05:51:52 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.086, System efficiency: 0.354, Number of speculative tokens: 2, Number of accepted tokens: 43521, Number of draft tokens: 508066, Number of emitted tokens: 269680.
After converting the Eagle checkpoints using the script, I encountered the same error as you did. I only succeeded in running the Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination.

However, when using Llama-2-7b-chat-hf, I encountered the following error: [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/init.py", line 19, in get_model [rank0]: return loader.load_model(model_config=model_config, [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/loader.py", line 402, in load_model [rank0]: model.load_weights(self._get_all_weights(model_config, model)) [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/models/eagle.py", line 149, in load_weights [rank0]: raise ValueError("Found bias in the loaded weights " [rank0]: ValueError: Found bias in the loaded weights but the model config doesn't have bias

I think the implemention of EAGLE on vLLM have samo limitation, that limits we can't run any model with EAGLE on vLLM, we need to find it by reading more vLLM code~Fighting!

bettybaii · 2024-12-16T08:04:26Z

In my case, after converting the eagle checkpoints using the script, I encountered following two errors. I only succedded in running Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination. vllm last commit: 69ba344

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

TP: 1
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
ERROR 12-16 05:03:28 engine.py:366]     loaded_weights = model.load_weights(
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/models/eagle.py", line 161, in load_weights
ERROR 12-16 05:03:28 engine.py:366]     raise ValueError("Found bias in the loaded weights "
ERROR 12-16 05:03:28 engine.py:366] ValueError: Found bias in the loaded weights but the model config doesn't have bias
Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

TP: 4
    self.worker = worker_class(*args, **kwargs)
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 92, in create_spec_worker
    spec_decode_worker = SpecDecodeWorker.create_worker(
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 182, in create_worker
    raise NotImplementedError(
NotImplementedError: EAGLE does not support TP > 1 yet
Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

TP: 1

Number of speculative tokens: 2

No errors (Run successful)

However, the acceptance rate is very low.

I benchmarked using LLMPerf by sending random inputs from sonnet.txt.
INFO 12-16 05:51:52 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.086, System efficiency: 0.354, Number of speculative tokens: 2, Number of accepted tokens: 43521, Number of draft tokens: 508066, Number of emitted tokens: 269680.
After converting the Eagle checkpoints using the script, I encountered the same error as you did. I only succeeded in running the Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination.

However, when using Llama-2-7b-chat-hf, I encountered the following error: [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/init.py", line 19, in get_model [rank0]: return loader.load_model(model_config=model_config, [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/loader.py", line 402, in load_model [rank0]: model.load_weights(self._get_all_weights(model_config, model)) [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/models/eagle.py", line 149, in load_weights [rank0]: raise ValueError("Found bias in the loaded weights " [rank0]: ValueError: Found bias in the loaded weights but the model config doesn't have bias

After adding "bias": true to the config.json file of EAGLE-llama2-chat-7B, Llama-2-7b-chat-hf was able to run normally. However, the acceptance rate still remains abnormally low.

xiongqisong · 2024-12-16T08:06:29Z

In my case, after converting the eagle checkpoints using the script, I encountered following two errors. I only succedded in running Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination. vllm last commit: 69ba344

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

TP: 1
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
ERROR 12-16 05:03:28 engine.py:366]     loaded_weights = model.load_weights(
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/models/eagle.py", line 161, in load_weights
ERROR 12-16 05:03:28 engine.py:366]     raise ValueError("Found bias in the loaded weights "
ERROR 12-16 05:03:28 engine.py:366] ValueError: Found bias in the loaded weights but the model config doesn't have bias
Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

TP: 4
    self.worker = worker_class(*args, **kwargs)
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 92, in create_spec_worker
    spec_decode_worker = SpecDecodeWorker.create_worker(
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 182, in create_worker
    raise NotImplementedError(
NotImplementedError: EAGLE does not support TP > 1 yet
Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

TP: 1

Number of speculative tokens: 2

No errors (Run successful)

However, the acceptance rate is very low.

I benchmarked using LLMPerf by sending random inputs from sonnet.txt.
INFO 12-16 05:51:52 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.086, System efficiency: 0.354, Number of speculative tokens: 2, Number of accepted tokens: 43521, Number of draft tokens: 508066, Number of emitted tokens: 269680.
After converting the Eagle checkpoints using the script, I encountered the same error as you did. I only succeeded in running the Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination.
However, when using Llama-2-7b-chat-hf, I encountered the following error: [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/init.py", line 19, in get_model [rank0]: return loader.load_model(model_config=model_config, [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/loader.py", line 402, in load_model [rank0]: model.load_weights(self._get_all_weights(model_config, model)) [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/models/eagle.py", line 149, in load_weights [rank0]: raise ValueError("Found bias in the loaded weights " [rank0]: ValueError: Found bias in the loaded weights but the model config doesn't have bias
After adding "bias": true to the config.json file of EAGLE-llama2-chat-7B, Llama-2-7b-chat-hf was able to run normally. However, the acceptance rate still remains abnormally low.

can you show your config.json of draft model after converting?

bettybaii · 2024-12-16T08:09:10Z

In my case, after converting the eagle checkpoints using the script, I encountered following two errors. I only succedded in running Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination. vllm last commit: 69ba344

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

TP: 1
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/model_loader/loader.py", line 364, in load_model
ERROR 12-16 05:03:28 engine.py:366]     loaded_weights = model.load_weights(
ERROR 12-16 05:03:28 engine.py:366]   File "/mnt/lvm/sungjae/pr/vllm/vllm/model_executor/models/eagle.py", line 161, in load_weights
ERROR 12-16 05:03:28 engine.py:366]     raise ValueError("Found bias in the loaded weights "
ERROR 12-16 05:03:28 engine.py:366] ValueError: Found bias in the loaded weights but the model config doesn't have bias
Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

TP: 4
    self.worker = worker_class(*args, **kwargs)
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 92, in create_spec_worker
    spec_decode_worker = SpecDecodeWorker.create_worker(
  File "/mnt/lvm/sungjae/pr/vllm/vllm/spec_decode/spec_decode_worker.py", line 182, in create_worker
    raise NotImplementedError(
NotImplementedError: EAGLE does not support TP > 1 yet
Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

TP: 1

Number of speculative tokens: 2

No errors (Run successful)

However, the acceptance rate is very low.

I benchmarked using LLMPerf by sending random inputs from sonnet.txt.
INFO 12-16 05:51:52 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.086, System efficiency: 0.354, Number of speculative tokens: 2, Number of accepted tokens: 43521, Number of draft tokens: 508066, Number of emitted tokens: 269680.
After converting the Eagle checkpoints using the script, I encountered the same error as you did. I only succeeded in running the Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B combination.
However, when using Llama-2-7b-chat-hf, I encountered the following error: [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/init.py", line 19, in get_model [rank0]: return loader.load_model(model_config=model_config, [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/model_loader/loader.py", line 402, in load_model [rank0]: model.load_weights(self._get_all_weights(model_config, model)) [rank0]: File "/mnt/baizhuoyan/vllm/vllm/model_executor/models/eagle.py", line 149, in load_weights [rank0]: raise ValueError("Found bias in the loaded weights " [rank0]: ValueError: Found bias in the loaded weights but the model config doesn't have bias
After adding "bias": true to the config.json file of EAGLE-llama2-chat-7B, Llama-2-7b-chat-hf was able to run normally. However, the acceptance rate still remains abnormally low.
can you show your config.json of draft model after converting?

config.json:
{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 1, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.32.0.dev0", "use_cache": true, "vocab_size": 32000, "bias": true}}

xiongqisong · 2024-12-16T08:19:55Z

@bettybaii my config.json is the same with you:

{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 1, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.32.0.dev0", "use_cache": true, "vocab_size": 32000, "bias": true}}

but i even can't run vLLM, what a pity....

xiongqisong · 2024-12-16T08:45:03Z

my config.json is the same with you:

{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 1, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.32.0.dev0", "use_cache": true, "vocab_size": 32000, "bias": true}}

but i even can't run vLLM, what a pity....
when i run the same command, i get the error:

    speculative_config = SpeculativeConfig.maybe_create_spec_config(
  File "/data1/xqs/vllm/vllm/config.py", line 1363, in maybe_create_spec_config
    draft_model_config = ModelConfig(
  File "/data1/xqs/vllm/vllm/config.py", line 210, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/data1/xqs/vllm/vllm/transformers_utils/config.py", line 192, in get_config
    config = config_class.from_pretrained(
  File "/data1/xqs/vllm/vllm/transformers_utils/configs/eagle.py", line 55, in from_pretrained
    return cls.from_dict(config_dict, **kwargs)
  File "/data1/xqs/miniconda3/envs/vllm-eagle/lib/python3.10/site-packages/transformers/configuration_utils.py", line 718, in from_dict
    config = cls(**config_dict)
  File "/data1/xqs/vllm/vllm/transformers_utils/configs/eagle.py", line 16, in __init__
    model_config = None if model is None else (AutoConfig.for_model(
  File "/data1/xqs/miniconda3/envs/vllm-eagle/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 912, in for_model
    raise ValueError(
ValueError: Unrecognized model identifier: eagle. Should contain one of albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth

bettybaii · 2024-12-16T09:15:44Z

my config.json is the same with you:

{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 1, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.32.0.dev0", "use_cache": true, "vocab_size": 32000, "bias": true}}

but i even can't run vLLM, what a pity....
when i run the same command, i get the error:

    speculative_config = SpeculativeConfig.maybe_create_spec_config(
  File "/data1/xqs/vllm/vllm/config.py", line 1363, in maybe_create_spec_config
    draft_model_config = ModelConfig(
  File "/data1/xqs/vllm/vllm/config.py", line 210, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/data1/xqs/vllm/vllm/transformers_utils/config.py", line 192, in get_config
    config = config_class.from_pretrained(
  File "/data1/xqs/vllm/vllm/transformers_utils/configs/eagle.py", line 55, in from_pretrained
    return cls.from_dict(config_dict, **kwargs)
  File "/data1/xqs/miniconda3/envs/vllm-eagle/lib/python3.10/site-packages/transformers/configuration_utils.py", line 718, in from_dict
    config = cls(**config_dict)
  File "/data1/xqs/vllm/vllm/transformers_utils/configs/eagle.py", line 16, in __init__
    model_config = None if model is None else (AutoConfig.for_model(
  File "/data1/xqs/miniconda3/envs/vllm-eagle/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 912, in for_model
    raise ValueError(
ValueError: Unrecognized model identifier: eagle. Should contain one of albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth

That’s strange… I haven’t encountered this error before, and I haven’t made any modifications to the Eagle-related code. Could you try pinpointing the cause in more detail?
Here is the command I used:

python ./benchmarks/benchmark_throughput.py  \
--backend vllm \
--tensor-parallel-size 2 \
--dataset /mnt/sharegpt_gpt4.jsonl \
--model /mnt/data/models/huggingface-models/Llama-2-7b-chat-hf/ \
--speculative-model /mnt/data/models/huggingface-models/speculative-models/EAGLE-llama2-chat-7B/ \
--speculative-draft-tensor-parallel-size 1 \
--num_speculative_tokens 3 \
--enforce-eager \
--num-prompts 40

xiongqisong · 2024-12-16T09:35:38Z

my config.json is the same with you:

{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 1, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.32.0.dev0", "use_cache": true, "vocab_size": 32000, "bias": true}}

but i even can't run vLLM, what a pity....
when i run the same command, i get the error:

    speculative_config = SpeculativeConfig.maybe_create_spec_config(
  File "/data1/xqs/vllm/vllm/config.py", line 1363, in maybe_create_spec_config
    draft_model_config = ModelConfig(
  File "/data1/xqs/vllm/vllm/config.py", line 210, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/data1/xqs/vllm/vllm/transformers_utils/config.py", line 192, in get_config
    config = config_class.from_pretrained(
  File "/data1/xqs/vllm/vllm/transformers_utils/configs/eagle.py", line 55, in from_pretrained
    return cls.from_dict(config_dict, **kwargs)
  File "/data1/xqs/miniconda3/envs/vllm-eagle/lib/python3.10/site-packages/transformers/configuration_utils.py", line 718, in from_dict
    config = cls(**config_dict)
  File "/data1/xqs/vllm/vllm/transformers_utils/configs/eagle.py", line 16, in __init__
    model_config = None if model is None else (AutoConfig.for_model(
  File "/data1/xqs/miniconda3/envs/vllm-eagle/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 912, in for_model
    raise ValueError(
ValueError: Unrecognized model identifier: eagle. Should contain one of albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth

That’s strange… I haven’t encountered this error before, and I haven’t made any modifications to the Eagle-related code. Could you try pinpointing the cause in more detail? Here is the command I used:

python ./benchmarks/benchmark_throughput.py  \
--backend vllm \
--tensor-parallel-size 2 \
--dataset /mnt/sharegpt_gpt4.jsonl \
--model /mnt/data/models/huggingface-models/Llama-2-7b-chat-hf/ \
--speculative-model /mnt/data/models/huggingface-models/speculative-models/EAGLE-llama2-chat-7B/ \
--speculative-draft-tensor-parallel-size 1 \
--num_speculative_tokens 3 \
--enforce-eager \
--num-prompts 40

the point is here, vLLM use hf's AutoConfig to init EAGLEConfig, but the model_type is eagle, hf's transformers lib doesn't support "eagle" model:

bettybaii · 2024-12-16T11:03:52Z

my config.json is the same with you:

{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 1, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.32.0.dev0", "use_cache": true, "vocab_size": 32000, "bias": true}}

but i even can't run vLLM, what a pity....
when i run the same command, i get the error:

    speculative_config = SpeculativeConfig.maybe_create_spec_config(
  File "/data1/xqs/vllm/vllm/config.py", line 1363, in maybe_create_spec_config
    draft_model_config = ModelConfig(
  File "/data1/xqs/vllm/vllm/config.py", line 210, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/data1/xqs/vllm/vllm/transformers_utils/config.py", line 192, in get_config
    config = config_class.from_pretrained(
  File "/data1/xqs/vllm/vllm/transformers_utils/configs/eagle.py", line 55, in from_pretrained
    return cls.from_dict(config_dict, **kwargs)
  File "/data1/xqs/miniconda3/envs/vllm-eagle/lib/python3.10/site-packages/transformers/configuration_utils.py", line 718, in from_dict
    config = cls(**config_dict)
  File "/data1/xqs/vllm/vllm/transformers_utils/configs/eagle.py", line 16, in __init__
    model_config = None if model is None else (AutoConfig.for_model(
  File "/data1/xqs/miniconda3/envs/vllm-eagle/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 912, in for_model
    raise ValueError(
ValueError: Unrecognized model identifier: eagle. Should contain one of albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth

That’s strange… I haven’t encountered this error before, and I haven’t made any modifications to the Eagle-related code. Could you try pinpointing the cause in more detail? Here is the command I used:

python ./benchmarks/benchmark_throughput.py  \
--backend vllm \
--tensor-parallel-size 2 \
--dataset /mnt/sharegpt_gpt4.jsonl \
--model /mnt/data/models/huggingface-models/Llama-2-7b-chat-hf/ \
--speculative-model /mnt/data/models/huggingface-models/speculative-models/EAGLE-llama2-chat-7B/ \
--speculative-draft-tensor-parallel-size 1 \
--num_speculative_tokens 3 \
--enforce-eager \
--num-prompts 40

the point is here, vLLM use hf's AutoConfig to init EAGLEConfig, but the model_type is eagle, hf's transformers lib doesn't support "eagle" model:

I haven’t encountered this issue; could it be related to the environment? My transformers version is 4.45.2.
After successfully running the code, I used print("EAGLEConfig", model_config) to output the result of model_config, which was as follows:

EAGLEConfig LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bias": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 1,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.45.2",
  "use_cache": true,
  "vocab_size": 32000
}

xiongqisong · 2024-12-17T09:39:34Z

@bettybaii
After pull the vLLM latest code, downgrade transformers version(the same as you), and change the code:

--- a/vllm/transformers_utils/configs/eagle.py
+++ b/vllm/transformers_utils/configs/eagle.py
@@ -11,7 +11,8 @@ class EAGLEConfig(PretrainedConfig):
                  model: Union[PretrainedConfig, dict, None] = None,
                  truncated_vocab_size: Optional[int] = None,
                  **kwargs):
-
+        if model is not None:
+            model['model_type']='llama'
         model_config = None if model is None else (AutoConfig.for_model(
             **model) if isinstance(model, dict) else model)

and shutdown torch dynamo compile, i can run EAGLE on vLLM base on llama-2:

import torch._dynamo
torch._dynamo.config.suppress_errors = True

if i don't shutdown torch dynamo compile, the propose stage will crash, i haven't figure the reason out.
And also, i change the config.json of draft model, because the bias problem, new version vLLM change the config name:

{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "eagle", "num_attention_heads": 32, "num_hidden_layers": 1, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.32.0.dev0", "use_cache": true, "vocab_size": 32000, "eagle_fc_bias": true}}

I add the 'eagle_fc_bias' at the end.
I find the Draft acceptance rate is 0.418 which is low, i think the reason is mentioned at #9565 (comment), EAGLE not just use one decode layer like target model, it has different pre process/middle process than original decoder layer, i think the fix haven't merge to the main branch, and now the code can't run EAGLE by no change.
The special handle of EAGLE is:

after combine the token generate by target model, cut first token of sequence off, then embedding sequcen, combine embedding and hiddent state as input to EAGLE
don't use input layernorm at decoder layer

bettybaii · 2024-12-17T12:57:32Z

@xiongqisong
Since I’m not very familiar with Eagle, I’m wondering if the special handling of EAGLE you mentioned is easy to modify within vllm? In addition, I noticed that after conversion, the draft model’s parameter size increased significantly (from 1.55GB to 3.4048GB), which resulted in a substantial increase in GPU memory consumption and considerably extended the computation time (with the average_time_per_proposal_tok_ms reaching nearly 4 ms). Are you interested in addressing the issue of efficiently using Eagle in vllm? We could potentially work together on this.

xiongqisong · 2024-12-18T03:04:46Z

@xiongqisong Since I’m not very familiar with Eagle, I’m wondering if the special handling of EAGLE you mentioned is easy to modify within vllm? In addition, I noticed that after conversion, the draft model’s parameter size increased significantly (from 1.55GB to 3.4048GB), which resulted in a substantial increase in GPU memory consumption and considerably extended the computation time (with the average_time_per_proposal_tok_ms reaching nearly 4 ms). Are you interested in addressing the issue of efficiently using Eagle in vllm? We could potentially work together on this.

I'm familiar with EAGLE, i have implemented chatglm/bluelm on EAGLE original github, and discuss some problem with EAGLE's author(liyuhui), i'm glab to help address the issue, althrough i'm not very familar with vLLM code, i can learn and modify vLLM code by this chance.

Env and observed phenomena

latest vLLM code
machine: v100, 1GPU, 32GB memory
target model: Llama-2-7b-chat-hf
draft model: https://huggingface.co/yuhuili/EAGLE-llama2-chat-7B which is from yuhuili's github
draft model parameters: 0.24B(Theoretically consumes 0.96GB memory of GPU in fp32, 0.48GB in bf16)
load draft model weights: 0.9278GB(i think is normal)
load target model weights: 12.53GB(i think is normal, mixed precision)
average_time_per_proposal_tok_ms: 3.11ms(this is wierd)
scoring_time_ms: 26.33ms(this is wierd,too, but i run target model singly, the time is the same, so maybe vLLM's statistical logic is wrong)
verification_time_ms: 1.4ms
speculative tokens: 4
draft acceptance rate: 0.421 (means 42.1%???)

clues

xiongqisong · 2024-12-19T03:55:08Z

@bettybaii @llsj14
From now on, i find some problem on EAGLE's implemention in vLLM:

EAGLE model's architecture is wrong, with 5 unnessary module

the right architecture in original EAGLE github is:

the wrong architecture in vLLM is:

EAGLE(
(model): LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): VocabParallelEmbedding(num_embeddings=32000, embedding_dim=4096, org_vocab_size=32000, num_embeddings_padded=32000, tp_size=1)
(layers): ModuleList(
(0): LlamaDecoderLayer(
(self_attn): LlamaAttention(
(qkv_proj): QKVParallelLinear(in_features=4096, output_features=12288, bias=False, tp_size=1, gather_output=False)
(o_proj): RowParallelLinear(input_features=4096, output_features=4096, bias=False, tp_size=1, reduce_results=True)
(rotary_emb): RotaryEmbedding(head_size=128, rotary_dim=128, max_position_embeddings=4096, base=10000.0, is_neox_style=True)
(attn): Attention(head_size=128, num_heads=32, num_kv_heads=32, scale=0.08838834764831845, backend=XFormersImpl)
)
(mlp): LlamaMLP(
(gate_up_proj): MergedColumnParallelLinear(in_features=4096, output_features=22016, bias=False, tp_size=1, gather_output=False)
(down_proj): RowParallelLinear(input_features=11008, output_features=4096, bias=False, tp_size=1, reduce_results=True)
(act_fn): SiluAndMul()
)
~~(input_layernorm): RMSNorm(hidden_size=4096, eps=1e-06)~~
(post_attention_layernorm): RMSNorm(hidden_size=4096, eps=1e-06)
)
)
~~(norm): RMSNorm(hidden_size=4096, eps=1e-06)~~
)
~~(lm_head): ParallelLMHead(num_embeddings=32000, embedding_dim=4096, org_vocab_size=32000, num_embeddings_padded=32000, tp_size=1)~~
~~(logits_processor): LogitsProcessor(vocab_size=32000, forg_vocab_size=32000, scale=1.0, logits_as_input=False)~~
~~(sampler): Sampler()~~
)
(fc): Linear(in_features=8192, out_features=4096, bias=True)
(lm_head): ParallelLMHead(num_embeddings=32000, embedding_dim=4096, org_vocab_size=32000, num_embeddings_padded=32000, tp_size=1)
(logits_processor): LogitsProcessor(vocab_size=32000, forg_vocab_size=32000, scale=1.0, logits_as_input=False)

PS: I can't upload too big image, so i use delete line instead.

code error

not use target model generate token as input to EAGLE draft model, the right way is concat seq+first generate token,then remove the first token(not first generate token, but the first token of seq), do embedding on new seq, then concat embedding with hidden state(calculate by target model)
not need inputer layernorm, EAGLE only use 1 decode layer of target model, not all the model.
not forward EAGLE x time to generate a proposal tree, the 'x' is the deepth of tree designed by user.

other error

even no proposal tokens be accepted by target model, the accept rate isn't 0, because vLLM consider target model generated token as both denominator and numerator, that makes use feel confused.

llsj14 · 2024-12-19T04:02:17Z

@xiongqisong
I believe this is a great finding. Perhaps we can discuss it further in a new issue. I would also like to delve deeper into the implementation details of EAGLE on vLLM. Could you create another issue so we can discuss this further?

llsj14 · 2024-12-19T05:37:24Z

@xiongqisong @bettybaii
I also tested after reading your comments. It seems that the issue with running the EAGLE model was solely due to the missing "eagle_fc_bias" key, rather than any other configuration settings. (I suggested changes to the script.)

For the EAGLE-LLaMA3-Instruct-70B model, I had to add the option --speculative-draft-tensor-parallel-size 1 because vLLM does not yet support TP > 1 for the EAGLE model.

I have also added experimental results and settings based on your help.

@sroy745
I hope this summary helps with your documentation on EAGLE.

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

eagle config.json

{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 1, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.32.0.dev0", "use_cache": true, "vocab_size": 32000, "eagle_fc_bias": true}}

vLLM running script

#!/bin/bash
python -m vllm.entrypoints.openai.api_server \
     --model /.../Llama-2-7b-chat-hf \
     --port 8089 \
     --disable-custom-all-reduce \
     --swap-space 0 \
     --gpu-memory-utilization 0.9 \
     --speculative-model /.../EAGLE-llama2-chat-7B \
     --num-speculative-tokens 1

accept rate on C4 dataset (input length < 1024, output length = 128)

INFO 12-19 05:27:19 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.246, System efficiency: 0.623, Number of speculative tokens: 1, Number of accepted tokens: 12915, Number of draft tokens: 52434, Number of emitted tokens: 65349.

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

eagle config.json

{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 128000, "eos_token_id": 128001, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 28672, "max_position_embeddings": 2048, "model_type": "llama", "num_attention_heads": 64, "num_key_value_heads": 8, "num_hidden_layers": 1, "pad_token_id": 0, "rms_norm_eps": 1e-05, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.28.1", "use_cache": true, "vocab_size": 128256, "rope_theta": 500000.0, "bias": false}}

vLLM running script

#!/bin/bash
python -m vllm.entrypoints.openai.api_server \
     --model /.../Llama-3.1-70B-Instruct \
     --port 8089 \
     --disable-custom-all-reduce \
     --swap-space 0 \
     --gpu-memory-utilization 0.9 \
     --speculative-model /.../EAGLE-LLaMA3-Instruct-70B \
     --num-speculative-tokens 1 \
     -tp 4 \
     --speculative-draft-tensor-parallel-size 1

accept rate on C4 dataset (input length < 1024, output length = 128)

INFO 12-19 05:20:29 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.242, System efficiency: 0.621, Number of speculative tokens: 1, Number of accepted tokens: 12932, Number of draft tokens: 53445, Number of emitted tokens: 66377.

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

eagle config.json

{"model_type": "eagle", "model": {"architectures": ["LlamaForCausalLM"], "bos_token_id": 128000, "eos_token_id": 128001, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 2048, "model_type": "llama", "num_attention_heads": 32, "num_key_value_heads": 8, "num_hidden_layers": 1, "pad_token_id": 0, "rms_norm_eps": 1e-05, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.28.1", "use_cache": true, "vocab_size": 128256, "rope_theta": 500000.0, "bias": false}}

vLLM running script

python -m vllm.entrypoints.openai.api_server \
     --model /mnt/lvm/checkpoints/hugginface/Meta-Llama-3-8B \
     --port 8087 \
     --disable-custom-all-reduce \
     --swap-space 0 \
     --gpu-memory-utilization 0.9 \
     --speculative-model /mnt/lvm/checkpoints/hugginface/EAGLE-LLaMA3-Instruct-8B \
     --num-speculative-tokens 1

accept rate on C4 dataset (input length < 1024, output length = 128)

INFO 12-19 05:31:51 metrics.py:489] Speculative metrics: Draft acceptance rate: 0.315, System efficiency: 0.657, Number of speculative tokens: 1, Number of accepted tokens: 16182, Number of draft tokens: 51400, Number of emitted tokens: 67582.

bettybaii · 2024-12-19T06:02:48Z

@xiongqisong
That’s great! It looks like this may be the root cause of the abnormal behavior of Eagle. I will further investigate the implementation of the original Eagle and the version of Eagle in vllm based on your findings.

sroy745 · 2024-12-19T13:55:16Z

Hi @llsj14 thanks for the detailed note. I will add this to the documentation shortly. @abhigoyal1997 wondering if we can add your script to the vLLM repo? That way it might be slightly easier to make and track changes?

@xiongqisong / @bettybaii as suggested by @llsj14 wondering if we can use this #9565 to track vLLM's Eagle performance? This tracks some of investigation done till now.

xiongqisong · 2024-12-20T02:43:07Z

Hi @llsj14 thanks for the detailed note. I will add this to the documentation shortly. @abhigoyal1997 wondering if we can add your script to the vLLM repo? That way it might be slightly easier to make and track changes?

@xiongqisong / @bettybaii as suggested by @llsj14 wondering if we can use this #9565 to track vLLM's Eagle performance? This tracks some of investigation done till now.

ok, we can talk the problems of EAGLE implemention on vLLM at #9565

xiongqisong · 2024-12-20T02:44:28Z

@xiongqisong I believe this is a great finding. Perhaps we can discuss it further in a new issue. I would also like to delve deeper into the implementation details of EAGLE on vLLM. Could you create another issue so we can discuss this further?

Glad to help, let's talk related questions at issue #9565

xiongqisong · 2024-12-20T02:47:54Z

@xiongqisong That’s great! It looks like this may be the root cause of the abnormal behavior of Eagle. I will further investigate the implementation of the original Eagle and the version of Eagle in vllm based on your findings.

Glad to help, i'm modifing the vLLM code to make EAGLE run correctly.

xiongqisong added the usage How to use vllm label Dec 12, 2024

llsj14 mentioned this issue Dec 16, 2024

[Performance]: vllm Eagle performance is worse than expected #9565

Open

1 task

xiongqisong mentioned this issue Dec 16, 2024

[Speculative Decoding] EAGLE Implementation with Top-1 proposer #6830

Merged

sroy745 mentioned this issue Dec 22, 2024

[Doc]Add documentation for using EAGLE in vLLM #11417

Merged

simon-mo mentioned this issue Dec 27, 2024

[Model] DeepSeek-V3 Enhancements #11539

Open

10 tasks

llsj14 mentioned this issue Jan 1, 2025

[Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design #11672

Merged

sroy745 mentioned this issue Jan 8, 2025

[Spec Decode] Add Script for converting HF Eagle checkpoint to vLLM compatible checkpoint #11866

Open

[Usage]: how to use EAGLE on vLLM? #11126

[Usage]: how to use EAGLE on vLLM? #11126

Comments

xiongqisong commented Dec 12, 2024

Your current environment

How would you like to use vllm

Before submitting a new issue...

llsj14 commented Dec 16, 2024 • edited Loading

Command

Error message

What I tried

xiongqisong commented Dec 16, 2024

I use vscode to debug EAGLE on vLLM, here is the debug config of mine:

the config.json file of draft model (/data1/xqs/EAGLE-llama) is:

the files contained in draft model folder is:

the convert script to change EAGLE draft model weights to vLLM format weights is:

the error i faced is:

llsj14 commented Dec 16, 2024 • edited Loading

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

llsj14 commented Dec 16, 2024

xiongqisong commented Dec 16, 2024

xiongqisong commented Dec 16, 2024 • edited Loading

bettybaii commented Dec 16, 2024

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

xiongqisong commented Dec 16, 2024

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

bettybaii commented Dec 16, 2024 • edited Loading

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

xiongqisong commented Dec 16, 2024

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

bettybaii commented Dec 16, 2024

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

xiongqisong commented Dec 16, 2024 • edited Loading

xiongqisong commented Dec 16, 2024

bettybaii commented Dec 16, 2024 • edited Loading

xiongqisong commented Dec 16, 2024

bettybaii commented Dec 16, 2024 • edited Loading

xiongqisong commented Dec 17, 2024 • edited Loading

bettybaii commented Dec 17, 2024

xiongqisong commented Dec 18, 2024

Env and observed phenomena

clues

xiongqisong commented Dec 19, 2024

EAGLE model's architecture is wrong, with 5 unnessary module

code error

other error

llsj14 commented Dec 19, 2024 • edited Loading

llsj14 commented Dec 19, 2024 • edited Loading

Llama-2-7b-chat-hf / EAGLE-llama2-chat-7B

Llama-3.1-70B-Instruct / EAGLE-LLaMA3-Instruct-70B

Meta-Llama-3-8B / EAGLE-LLaMA3-Instruct-8B

bettybaii commented Dec 19, 2024

sroy745 commented Dec 19, 2024

xiongqisong commented Dec 20, 2024

xiongqisong commented Dec 20, 2024

xiongqisong commented Dec 20, 2024

llsj14 commented Dec 16, 2024 •

edited

Loading

llsj14 commented Dec 16, 2024 •

edited

Loading

xiongqisong commented Dec 16, 2024 •

edited

Loading

bettybaii commented Dec 16, 2024 •

edited

Loading

xiongqisong commented Dec 16, 2024 •

edited

Loading

bettybaii commented Dec 16, 2024 •

edited

Loading

bettybaii commented Dec 16, 2024 •

edited

Loading

xiongqisong commented Dec 17, 2024 •

edited

Loading

llsj14 commented Dec 19, 2024 •

edited

Loading

llsj14 commented Dec 19, 2024 •

edited

Loading