Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Speculative decoding] CUDA graph support #4295

Merged
merged 23 commits into from
May 10, 2024

Conversation

heeju-kim2
Copy link
Contributor

@heeju-kim2 heeju-kim2 commented Apr 23, 2024

We add a end-to-end test that verifies speculative decoding works with CUDA graphs. Specifically, we ensure that a model produces the same output when speculative decoding is enabled vs not, when cuda graphs are enabled.

@cadedaniel
Copy link
Collaborator

cadedaniel commented Apr 23, 2024

Thanks for the PR Heeju!

I will take a look today.

@cadedaniel cadedaniel changed the title update CUDA graph w/ lookahead slot [Speculative decoding] CUDA graph support Apr 23, 2024
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}"



Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look like mistaken changes/invalid syntax - please restore this file

@cadedaniel
Copy link
Collaborator

Thanks again @heeju-kim2 for the PR.

So I ran spec decode locally with cuda graphs enabled, and I find that it works. Are you misconfiguring spec decode?

You can see what I saw by running the test below, which uses run_greedy_equality_correctness_test from test_correctness.py.

By the way, if this works for you, you can update this PR to simply include this test, remove the other changes, and we can merge it. Thanks for contributing to vLLM :)

@pytest.mark.parametrize(
    "common_llm_kwargs",
    [{  
        # Required for spec decode.
        "use_v2_block_manager": True,
    }]) 
@pytest.mark.parametrize("per_test_common_llm_kwargs", [
{
    # Identical models.
    "model": "JackFram/llama-68m",
    "speculative_model": "JackFram/llama-68m",
    "num_speculative_tokens": 5,
},
{
    # Distinct models.
    "model": "JackFram/llama-160m",
    "speculative_model": "JackFram/llama-68m",
    "num_speculative_tokens": 5,
}
])
@pytest.mark.parametrize("baseline_llm_kwargs", [{}])
@pytest.mark.parametrize("test_llm_kwargs", [{
    "enforce_eager": False,
}])
@pytest.mark.parametrize("batch_size", [1, 32])
@pytest.mark.parametrize("output_len", [128])
@pytest.mark.parametrize("seed", [1])
def test_spec_decode_cuda_graph(baseline_llm_generator, test_llm_generator, batch_size, output_len):
    run_greedy_equality_correctness_test(
        baseline_llm_generator,
        test_llm_generator,
        batch_size,
        max_output_len=output_len,
        force_output_len=True,
    )

@cadedaniel
Copy link
Collaborator

Thanks for the PR @heeju-kim2 !

@cadedaniel cadedaniel enabled auto-merge (squash) May 10, 2024 05:24
@cadedaniel cadedaniel merged commit 2e7796f into vllm-project:main May 10, 2024
55 checks passed
robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 19, 2024
dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants