Question about chunked prefill #12145
Unanswered
wearegolden
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
My understanding is that if i run my model with
enable_chunked_prefill=True
andmax_num_batched_tokens
the same as the max_length of my model, this is the same as no chunked prefill with decoding prioritized over prefills.So my assumption was that if i send requests 1 by 1, there would be no prioritizing to do since there is only 1 request to handle at a given time, thereby enabling/disabling chunked prefill should give the same results. However, this was not the case.
I did some digging into the code, and I'm suspecting this part is the reason for the difference in results.
Here, the
self.block_tables
results in an empty list when disabling chunked prefill, and a non-empty list when enabling chunked prefill.vllm/vllm/attention/backends/flash_attn.py
Lines 437 to 444 in a6221a1
This further leads to the codes diverging here, where the
block_table
argument for the functionflash_attn_varlen_func
differs.vllm/vllm/attention/backends/flash_attn.py
Lines 857 to 904 in a6221a1
Can anyone explain to me the reason chunked_prefill needs
block_tables
in the prefill phase? Thank you in advance.Beta Was this translation helpful? Give feedback.
All reactions