[Dynamic Spec Decoding] Minor fix for disabling speculative decoding #5000

LiuXiaoxuanPKU · 2024-05-23T07:23:08Z

In the current implementation, even if running_queue_size >= speculative_disable_by_batch_size, it will still go through the speculative decoding logic, which includes get_spec_proposals with k=0, score_proposal, rejection sampling and create sampler output. The flow introduces extra overhead (especially rejection sampling), which makes disabling speculative decoding slower than 'real' without speculative decoding.

To fix this, we can just reuse the _run_no_spec to avoid touching the sd flow at all. Also add a test to check the correctness.

Concretely, for a batch size of 8 with 128 output tokens, TP=4, for LLama3-70B, the batch latency is

Without SD	Disable SD	Disable SD after fix
5.4 s	5.8s	5.5s

Here, without SD means not using sd flag at all. Disable SD means using SD flag, but set speculative_disable_by_batch_size smaller than batch size to disable speculative decoding.
After the fix, we are still slower than the original case, this is caused by broadcasting control flow, which will be fixed in future PRs.

comaniac

Thanks for the fix!

comaniac · 2024-05-23T14:27:58Z

vllm/spec_decode/spec_decode_worker.py

@@ -276,7 +276,8 @@ def execute_model(
        # If no spec tokens, call the proposer and scorer workers normally.
        # Used for prefill.


Refine the comment to include auto disable?

cadedaniel

LGTM, let's update the comment before merging

cadedaniel · 2024-05-23T17:32:19Z

tests/spec_decode/e2e/test_ngram_correctness.py

+                             "ngram_prompt_lookup_max": 3,
+                             "speculative_disable_by_batch_size": 4
+                         }])
+@pytest.mark.parametrize("batch_size", [1, 2, 5, 8])


nit: suggest only [1, 5] to reduce test time

comaniac · 2024-05-25T16:59:30Z

Looks like the CI failure is unrelated and we should just merge this. cc @simon-mo

…llm-project#5000)

test and fix

adaccf8

LiuXiaoxuanPKU requested review from cadedaniel and comaniac May 23, 2024 07:23

comaniac approved these changes May 23, 2024

View reviewed changes

cadedaniel approved these changes May 23, 2024

View reviewed changes

LiuXiaoxuanPKU added 2 commits May 23, 2024 19:16

fix comments

e027eb6

fix tests

b83f540

simon-mo merged commit d5a1697 into main May 25, 2024
63 of 65 checks passed

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 31, 2024

[Dynamic Spec Decoding] Minor fix for disabling speculative decoding (v…

fbfc102

…llm-project#5000)

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 8, 2024

[Dynamic Spec Decoding] Minor fix for disabling speculative decoding (v…

8768b3f

…llm-project#5000)

joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024

[Dynamic Spec Decoding] Minor fix for disabling speculative decoding (v…

cf06cc8

…llm-project#5000)

bong-furiosa mentioned this pull request Jun 27, 2024

[Feature]: Request for SmartSpec Method Support #5886

Closed

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jul 14, 2024

[Dynamic Spec Decoding] Minor fix for disabling speculative decoding (v…

94f555c

…llm-project#5000)

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Dynamic Spec Decoding] Minor fix for disabling speculative decoding (v…

e7ca6ab

…llm-project#5000)

simon-mo deleted the disable-queue-size branch October 28, 2024 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dynamic Spec Decoding] Minor fix for disabling speculative decoding #5000

[Dynamic Spec Decoding] Minor fix for disabling speculative decoding #5000

LiuXiaoxuanPKU commented May 23, 2024

comaniac left a comment

comaniac May 23, 2024

cadedaniel left a comment

cadedaniel May 23, 2024

comaniac commented May 25, 2024

		@@ -276,7 +276,8 @@ def execute_model(
		# If no spec tokens, call the proposer and scorer workers normally.
		# Used for prefill.

[Dynamic Spec Decoding] Minor fix for disabling speculative decoding #5000

[Dynamic Spec Decoding] Minor fix for disabling speculative decoding #5000

Conversation

LiuXiaoxuanPKU commented May 23, 2024

comaniac left a comment

Choose a reason for hiding this comment

comaniac May 23, 2024

Choose a reason for hiding this comment

cadedaniel left a comment

Choose a reason for hiding this comment

cadedaniel May 23, 2024

Choose a reason for hiding this comment

comaniac commented May 25, 2024