Question about chunked prefill #12145

wearegolden · 2025-01-17T06:03:53Z

wearegolden
Jan 17, 2025

My understanding is that if i run my model with enable_chunked_prefill=True and max_num_batched_tokens the same as the max_length of my model, this is the same as no chunked prefill with decoding prioritized over prefills.
So my assumption was that if i send requests 1 by 1, there would be no prioritizing to do since there is only 1 request to handle at a given time, thereby enabling/disabling chunked prefill should give the same results. However, this was not the case.

I did some digging into the code, and I'm suspecting this part is the reason for the difference in results.
Here, the self.block_tables results in an empty list when disabling chunked prefill, and a non-empty list when enabling chunked prefill.

vllm/vllm/attention/backends/flash_attn.py

Lines 437 to 444 in a6221a1

    
           elif ((chunked_prefill_enabled or not is_prompt) 
        
                 and block_tables is not None): 
        
               if curr_sliding_window_block == 0: 
        
                   block_table = block_tables[seq_id] 
        
               else: 
        
                   block_table = block_tables[seq_id][ 
        
                       -curr_sliding_window_block:] 
        
           self.block_tables.append(block_table)

This further leads to the codes diverging here, where the block_table argument for the function flash_attn_varlen_func differs.

vllm/vllm/attention/backends/flash_attn.py

Lines 857 to 904 in a6221a1

    
           if prefill_meta := attn_metadata.prefill_metadata: 
        
               # Prompt run. 
        
               if (kv_cache.numel() == 0 or prefill_meta.block_tables is None 
        
                       or prefill_meta.block_tables.numel() == 0): 
        
                   # normal attention 
        
                   # When block_tables are not filled, it means q and k are the 
        
                   # prompt, and they have the same length. 
        
                   q_seq_start_loc, q_seq_len, k_seq_start_loc, k_seq_len = \ 
        
                       _get_query_key_seq_metadata(prefill_meta, True, attn_type) 
        
                   key = key[:num_prefill_kv_tokens] 
        
                   value = value[:num_prefill_kv_tokens] 
        
                   prefill_output = flash_attn_varlen_func( 
        
                       q=query, 
        
                       k=key, 
        
                       v=value, 
        
                       cu_seqlens_q=q_seq_start_loc, 
        
                       cu_seqlens_k=k_seq_start_loc, 
        
                       max_seqlen_q=q_seq_len, 
        
                       max_seqlen_k=k_seq_len, 
        
                       softmax_scale=softmax_scale, 
        
                       causal=_get_causal_option(attn_type), 
        
                       window_size=window_size, 
        
                       alibi_slopes=alibi_slopes, 
        
                       softcap=logits_soft_cap, 
        
                   ) 
        
               else: 
        
                   # prefix-enabled attention 
        
                   assert attn_type == AttentionType.DECODER, ( 
        
                       "Only decoder-only models support prefix caching") 
        
                   assert prefill_meta.seq_lens is not None 
        
                   max_seq_len = max(prefill_meta.seq_lens) 
        
                   prefill_output = flash_attn_varlen_func(  # noqa 
        
                       q=query, 
        
                       k=key_cache, 
        
                       v=value_cache, 
        
                       cu_seqlens_q=prefill_meta.query_start_loc, 
        
                       max_seqlen_q=prefill_meta.max_query_len, 
        
                       cu_seqlens_k=prefill_meta.seq_start_loc, 
        
                       max_seqlen_k=max_seq_len, 
        
                       softmax_scale=softmax_scale, 
        
                       causal=True, 
        
                       window_size=window_size, 
        
                       alibi_slopes=alibi_slopes, 
        
                       block_table=prefill_meta.block_tables, 
        
                       softcap=logits_soft_cap, 
        
                   )

Can anyone explain to me the reason chunked_prefill needs block_tables in the prefill phase? Thank you in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about chunked prefill #12145

{{title}}

Replies: 0 comments

Select a reply

Question about chunked prefill #12145

wearegolden Jan 17, 2025

Replies: 0 comments

wearegolden
Jan 17, 2025