[Speculative decoding] CUDA graph support #4295

heeju-kim2 · 2024-04-23T13:57:44Z

We add a end-to-end test that verifies speculative decoding works with CUDA graphs. Specifically, we ensure that a model produces the same output when speculative decoding is enabled vs not, when cuda graphs are enabled.

cadedaniel · 2024-04-23T17:57:22Z

Thanks for the PR Heeju!

I will take a look today.

mgoin · 2024-04-23T18:07:43Z

examples/offline_inference.py

+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}"
+
+
+


These look like mistaken changes/invalid syntax - please restore this file

cadedaniel · 2024-04-23T23:30:36Z

Thanks again @heeju-kim2 for the PR.

So I ran spec decode locally with cuda graphs enabled, and I find that it works. Are you misconfiguring spec decode?

You can see what I saw by running the test below, which uses run_greedy_equality_correctness_test from test_correctness.py.

By the way, if this works for you, you can update this PR to simply include this test, remove the other changes, and we can merge it. Thanks for contributing to vLLM :)

@pytest.mark.parametrize(
    "common_llm_kwargs",
    [{  
        # Required for spec decode.
        "use_v2_block_manager": True,
    }]) 
@pytest.mark.parametrize("per_test_common_llm_kwargs", [
{
    # Identical models.
    "model": "JackFram/llama-68m",
    "speculative_model": "JackFram/llama-68m",
    "num_speculative_tokens": 5,
},
{
    # Distinct models.
    "model": "JackFram/llama-160m",
    "speculative_model": "JackFram/llama-68m",
    "num_speculative_tokens": 5,
}
])
@pytest.mark.parametrize("baseline_llm_kwargs", [{}])
@pytest.mark.parametrize("test_llm_kwargs", [{
    "enforce_eager": False,
}])
@pytest.mark.parametrize("batch_size", [1, 32])
@pytest.mark.parametrize("output_len", [128])
@pytest.mark.parametrize("seed", [1])
def test_spec_decode_cuda_graph(baseline_llm_generator, test_llm_generator, batch_size, output_len):
    run_greedy_equality_correctness_test(
        baseline_llm_generator,
        test_llm_generator,
        batch_size,
        max_output_len=output_len,
        force_output_len=True,
    )

… vllm_branch

This reverts commit 699830730fcea9827e895f66cc380ce707e5b65c.

cadedaniel · 2024-05-10T05:23:17Z

Thanks for the PR @heeju-kim2 !

Co-authored-by: Cade Daniel <[email protected]>

update CUDA graph w/ lookahead slot

d52b419

cadedaniel changed the title ~~update CUDA graph w/ lookahead slot~~ [Speculative decoding] CUDA graph support Apr 23, 2024

mgoin reviewed Apr 23, 2024

View reviewed changes

heeju-kim2 mentioned this pull request Apr 24, 2024

[Speculative decoding] CUDA graph support including e2e test #4320

Closed

heeju-kim2 and others added 21 commits April 24, 2024 06:33

add test_spec_decode_cuda_graph

f7cb50c

add test_spec_decode_cuda_graph

7e549cf

fix format

0063fb8

Merge branch 'main' into vllm_branch

8d584ee

fix syntax

06ca163

Merge branch 'vllm_branch' of https://github.com/heeju-kim2/vllm into…

5cd8490

… vllm_branch

fix syntax

99f2fbd

fix format

a6b598d

fix format

70fac3e

fix format

3bb45c0

fix format

237286c

fix format

d85bbdd

fix format

7269347

fix format

3f8b337

disable prefill use_cuda_graph

838c4ab

fix typo

ad92c25

Revert "add"

2aaa45d

This reverts commit 699830730fcea9827e895f66cc380ce707e5b65c.

Merge remote-tracking branch 'upstream/main' into HEAD

587909c

lint

13ccc74

merge 2

b9a8d8c

fix test

bc2ec02

cadedaniel approved these changes May 10, 2024

View reviewed changes

lint

0d68390

cadedaniel enabled auto-merge (squash) May 10, 2024 05:24

cadedaniel merged commit 2e7796f into vllm-project:main May 10, 2024
55 checks passed

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 19, 2024

[Speculative decoding] CUDA graph support (vllm-project#4295)

f739bdb

Co-authored-by: Cade Daniel <[email protected]>

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024

[Speculative decoding] CUDA graph support (vllm-project#4295)

3498e74

Co-authored-by: Cade Daniel <[email protected]>

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Speculative decoding] CUDA graph support (vllm-project#4295)

3230741

Co-authored-by: Cade Daniel <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speculative decoding] CUDA graph support #4295

[Speculative decoding] CUDA graph support #4295

heeju-kim2 commented Apr 23, 2024 •

edited by cadedaniel

Loading

cadedaniel commented Apr 23, 2024 •

edited

Loading

mgoin Apr 23, 2024

cadedaniel commented Apr 23, 2024

cadedaniel commented May 10, 2024

		print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}"

[Speculative decoding] CUDA graph support #4295

[Speculative decoding] CUDA graph support #4295

Conversation

heeju-kim2 commented Apr 23, 2024 • edited by cadedaniel Loading

cadedaniel commented Apr 23, 2024 • edited Loading

mgoin Apr 23, 2024

Choose a reason for hiding this comment

cadedaniel commented Apr 23, 2024

cadedaniel commented May 10, 2024

heeju-kim2 commented Apr 23, 2024 •

edited by cadedaniel

Loading

cadedaniel commented Apr 23, 2024 •

edited

Loading