-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Speculative decoding] CUDA graph support #4295
Conversation
Thanks for the PR Heeju! I will take a look today. |
examples/offline_inference.py
Outdated
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}" | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These look like mistaken changes/invalid syntax - please restore this file
Thanks again @heeju-kim2 for the PR. So I ran spec decode locally with cuda graphs enabled, and I find that it works. Are you misconfiguring spec decode? You can see what I saw by running the test below, which uses By the way, if this works for you, you can update this PR to simply include this test, remove the other changes, and we can merge it. Thanks for contributing to vLLM :) @pytest.mark.parametrize(
"common_llm_kwargs",
[{
# Required for spec decode.
"use_v2_block_manager": True,
}])
@pytest.mark.parametrize("per_test_common_llm_kwargs", [
{
# Identical models.
"model": "JackFram/llama-68m",
"speculative_model": "JackFram/llama-68m",
"num_speculative_tokens": 5,
},
{
# Distinct models.
"model": "JackFram/llama-160m",
"speculative_model": "JackFram/llama-68m",
"num_speculative_tokens": 5,
}
])
@pytest.mark.parametrize("baseline_llm_kwargs", [{}])
@pytest.mark.parametrize("test_llm_kwargs", [{
"enforce_eager": False,
}])
@pytest.mark.parametrize("batch_size", [1, 32])
@pytest.mark.parametrize("output_len", [128])
@pytest.mark.parametrize("seed", [1])
def test_spec_decode_cuda_graph(baseline_llm_generator, test_llm_generator, batch_size, output_len):
run_greedy_equality_correctness_test(
baseline_llm_generator,
test_llm_generator,
batch_size,
max_output_len=output_len,
force_output_len=True,
) |
This reverts commit 699830730fcea9827e895f66cc380ce707e5b65c.
Thanks for the PR @heeju-kim2 ! |
Co-authored-by: Cade Daniel <[email protected]>
Co-authored-by: Cade Daniel <[email protected]>
Co-authored-by: Cade Daniel <[email protected]>
We add a end-to-end test that verifies speculative decoding works with CUDA graphs. Specifically, we ensure that a model produces the same output when speculative decoding is enabled vs not, when cuda graphs are enabled.