[V1] Move more control of kv cache initialization from model_executor to EngineCore #11960

heheda12345 · 2025-01-11T15:30:49Z

This pr changes the workflow of EngineCore._initialize_kv_caches to enable more flexible control of kv cache format in the future.
It is splitted from #11938 and is a preparation for #11382
Original workflow:

num_gpu_blocks, _ = self.model_executor.determine_num_available_blocks()
self.model_executor.initialize(num_gpu_blocks)

New workflow:

# Get all kv cache tensor needed by the model
kv_cache_spec = self.model_executor.get_kv_cache_spec()

# Profiles the peak memory usage of the model to determine how much
# memory can be allocated for kv cache.
availble_gpu_memory = self.model_executor.get_available_memory()

# Get the kv cache tensor size
kv_cache_config, num_gpu_blocks = get_kv_cache_config(
    vllm_config, kv_cache_spec, availble_gpu_memory)


# Initialize kv cache and warmup the execution
self.model_executor.initialize(kv_cache_config)

This pr introduces 2 new concepts:

KVCacheSpec, a data structure to represent the kv cache needed by each attention layer, which is constructed by asking the model runner to analyze all Attention modules. Will add more types of Spec in the future, e.g., SlidingWindowSpec, MLASpec
KVCacheConfig, a class to represent how to allocate the kv cache Tensor. It is quite simple now, i.e., tensors with the same size. But it may be extended to the following cases:
1. tensors with different sizes, to support MLA & spec decode
2. allocate a global buffer, and make the kv_cache tensors to point to different offsets, to support multiple types of layer sharing the same memory pool.

Signed-off-by: Chen Zhang <[email protected]>

github-actions · 2025-01-11T15:31:01Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

comaniac

Overall LGTM. Most comments are style changes. One important thing: Please docstring all functions and files in the desired format:

def func_name(arg1, arg2):
    """What does this function do?
    
    Args:
        arg1: ...
        arg2: ...
    
    Returns:
        ... (skip if return None) ...
    """

vllm/v1/engine/core.py

vllm/v1/core/kv_cache_utils.py

comaniac · 2025-01-13T17:24:03Z

vllm/v1/core/kv_cache_utils.py

+    return kv_cache_config, num_gpu_blocks
+
+
+def is_same_key(kv_cache_spec: KVCacheSpec) -> bool:


Can we try to make this name more informative?

KVCacheSpecBase.key -> KVCacheSpecBase.type_id
is_same_key -> is_same_type

vllm/v1/core/kv_cache_utils.py

comaniac · 2025-01-13T17:31:36Z

vllm/v1/utils.py

+def bind_kv_cache(
+    ctx: Dict[str, Any],
+    runner_kv_caches: List[torch.Tensor],
+    kv_caches: Dict[str, torch.Tensor],
+) -> None:


Should this be in the kv_cache_utils.py as it is kv cache related?

bind_kv_cache is called by GPUModelRunner. I think it is strange to let GPUModelRunner call a function inside core. These two parts should be independent.

vllm/v1/utils.py

vllm/v1/worker/gpu_model_runner.py

vllm/v1/utils.py

vllm/v1/kv_cache_interface.py

comaniac · 2025-01-13T17:41:34Z

vllm/v1/kv_cache_interface.py

+    # [group_id][layer_name in the group]. One group containing all
+    # layer_names if the Spec for kv_cache of all layers are the same


I don't understand [group_id][layer_name in the group]. What's group ID? It might be better to just show an example.

# A list of kv-cache groups. Each group includes a set of layers with # the same kv-cache spec. For example: ...

Comments updated. Do you feel better now?

Signed-off-by: Chen Zhang <[email protected]>

Co-authored-by: Cody Yu <[email protected]> Signed-off-by: Chen Zhang <[email protected]>

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 · 2025-01-14T10:50:26Z

@comaniac Thank you for the review. I've updated the code based on your suggestions. Can you take another look?

comaniac

Should be the last batch of comments. Approve to unblock the PR first.

vllm/v1/core/kv_cache_utils.py

comaniac · 2025-01-15T17:46:49Z

vllm/v1/core/kv_cache_utils.py

+            f"initializing the engine.")
+
+
+def is_same_type(kv_cache_spec: KVCacheSpec) -> bool:


I feel this function name is a bit unclear. It actually checks whether the "kv cache specs" of "all" layers are the same, so it should be informative about "spec", "all" and "same". Maybe "is_uniformed_kv_cache_type" or something like that would be better.

changed to is_kv_cache_type_uniform

comaniac · 2025-01-15T17:53:18Z

vllm/v1/core/kv_cache_utils.py

+def get_kv_cache_config(vllm_config: VllmConfig, kv_cache_spec: KVCacheSpec,
+                        available_memory: int) -> Tuple[KVCacheConfig, int]:


Again this function name (also the callee _get_kv_cache_config_same_type) doesn't illustrate the number of GPU blocks. It's weird to see config, num_blocks = get_kv_cache_config(...).

One possibility:

def get_kv_cache_config_and_available_blocks(...): check_enough_kv_cache_memory(...) # Later maybe you can introduce a registry when you have more policies. if is_uniformed_kv_cache_type(...): return _get_kv_cache_config_and_blocks_for_unifiemd_type(...) return _get_kv_cache_config_and_blocks_for_xxx(...)

Changed num_blocks to an attribute of KVCacheConfig

vllm/v1/executor/multiproc_executor.py

comaniac · 2025-01-15T17:55:47Z

vllm/v1/executor/multiproc_executor.py

+        kv_cache_specs = self.collective_rpc("get_kv_cache_spec")
+        assert all(lc == kv_cache_specs[0] for lc in kv_cache_specs)
+        return kv_cache_specs[0]


How would this be extended later if you have different specs?

It won't be extended. kv_cache_spec[i] is for all layers of one GPU. Different TP GPUs always have the same spec though the spec of each layer inside one GPU can be different.
PP executors of different stages can have different specs.

comaniac · 2025-01-15T17:58:38Z

vllm/v1/utils.py

+    Bind kv_caches to the forward context and model_runner's kv_cache.
+    Args:


Please elaborate more, because the concept of "binding" kv-cache is not common for most people. For example, bind what kv-cache to what, and what's the purpose.

comaniac · 2025-01-15T17:59:07Z

vllm/v1/utils.py

+        assert all(kv_caches[n] is kv_caches[layer_name]
+                   for n in layer_names[1:])


Add a comment if you're going to extend this logic for xxx; otherwise it looks a bit weird.

Changed to raise an error if multiple attention layers have the same layer_index.
Layer_name and layer_index are defined by model implementation instead of this pr. There will be no conflict of layer_index in decoder-only models.

vllm/v1/core/kv_cache_utils.py

Signed-off-by: Chen Zhang <[email protected]>

comaniac

LGTM. Please also sync the main branch. The SPMD PR may have incompatible changes to this one

comaniac · 2025-01-16T16:55:01Z

vllm/v1/core/kv_cache_utils.py

+    Args:
+        kv_cache_spec (KVCacheSpec): The KVCacheSpec of the model
+
+            Returns:


Signed-off-by: Chen Zhang <[email protected]>

heheda12345 · 2025-01-16T17:24:31Z

@comaniac Thanks for the review. I've update this PR.

heheda12345 · 2025-01-17T03:10:17Z

Trying to fix the CI failure with #12138

youkaichao · 2025-01-17T14:38:45Z

vllm/v1/kv_cache_interface.py

this is not really v1-specific, I think we should just make it for v0 and v1 directly. I really hate code duplication, it makes later bugfix and v0-v1 agnostic features difficult to develop.

youkaichao · 2025-01-17T14:41:14Z

tests/v1/test_utils.py

this is heavy code duplication.

youkaichao · 2025-01-17T14:41:52Z

vllm/v1/utils.py

@@ -134,3 +141,48 @@ def shutdown(proc: multiprocessing.Process, input_path: str, output_path: str):
        socket_file = ipc_socket.replace("ipc://", "")
        if os and os.path.exists(socket_file):
            os.remove(socket_file)
+
+
+def bind_kv_cache(


this is heavy code duplication, too.

youkaichao · 2025-01-17T14:44:16Z

vllm/v1/executor/ray_executor.py

        self._run_workers("compile_or_warm_up_model")

+    def get_kv_cache_spec(self) -> KVCacheSpec:


this executor is not used. you should add it in vllm/executor/executor_base.py , and do not use _run_workers . use collective_rpc instead.

can run

0827ca8

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners January 11, 2025 15:30

heheda12345 changed the title ~~[V1] Move more control of kv cache initialization from model_executor to to EngineCore~~ [V1] Move more control of kv cache initialization from model_executor to EngineCore Jan 11, 2025

comaniac self-assigned this Jan 11, 2025

comaniac reviewed Jan 13, 2025

View reviewed changes

heheda12345 and others added 9 commits January 13, 2025 23:25

update comment

6025d5e

Signed-off-by: Chen Zhang <[email protected]>

update comment

6024290

Signed-off-by: Chen Zhang <[email protected]>

format

03130cd

Signed-off-by: Chen Zhang <[email protected]>

format

1229600

Signed-off-by: Chen Zhang <[email protected]>

bind kv cache to model runner

e3764d4

Signed-off-by: Chen Zhang <[email protected]>

determine_available_memory

fec7d2d

Signed-off-by: Chen Zhang <[email protected]>

update kv_cache_utils

4294435

Signed-off-by: Chen Zhang <[email protected]>

Update vllm/v1/utils.py

f79dff2

Co-authored-by: Cody Yu <[email protected]> Signed-off-by: Chen Zhang <[email protected]>

Update vllm/v1/worker/gpu_model_runner.py

97176da

Co-authored-by: Cody Yu <[email protected]> Signed-off-by: Chen Zhang <[email protected]>

heheda12345 force-pushed the v1_kv_init branch from b3ca6f2 to 97176da Compare January 14, 2025 07:26

heheda12345 added 6 commits January 14, 2025 00:23

add some comments

e6179a8

Signed-off-by: Chen Zhang <[email protected]>

add more comments

eb37f0c

Signed-off-by: Chen Zhang <[email protected]>

format

88fd1b8

Signed-off-by: Chen Zhang <[email protected]>

update comment

3493061

Signed-off-by: Chen Zhang <[email protected]>

update comment

105814a

Signed-off-by: Chen Zhang <[email protected]>

small updates

9ff57d0

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 mentioned this pull request Jan 15, 2025

[V1][WIP] Add KV cache group dimension to block table #12086

Draft

comaniac approved these changes Jan 15, 2025

View reviewed changes

heheda12345 added 2 commits January 16, 2025 08:45

update comments and function names

044876e

Signed-off-by: Chen Zhang <[email protected]>

format

62f2c09

Signed-off-by: Chen Zhang <[email protected]>

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 16, 2025

comaniac reviewed Jan 16, 2025

View reviewed changes

vllm/v1/core/kv_cache_utils.py Outdated

Args:

kv_cache_spec (KVCacheSpec): The KVCacheSpec of the model

Returns:

Copy link

Collaborator

comaniac Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent

heheda12345 added 3 commits January 16, 2025 09:02

format

138a4ac

Signed-off-by: Chen Zhang <[email protected]>

Merge branch 'main' of github.com:vllm-project/vllm into v1_kv_init

00f2bda

update docstring

2aa7509

Signed-off-by: Chen Zhang <[email protected]>

comaniac enabled auto-merge (squash) January 16, 2025 17:26

Merge branch 'main' of github.com:vllm-project/vllm into v1_kv_init

e8a1eb0

comaniac merged commit 69d765f into vllm-project:main Jan 17, 2025
54 checks passed

youkaichao reviewed Jan 17, 2025

View reviewed changes

tests/v1/test_utils.py

Copy link

Member

youkaichao Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is heavy code duplication.

youkaichao reviewed Jan 17, 2025

View reviewed changes

youkaichao mentioned this pull request Jan 17, 2025

[core] clean up executor class hierarchy between v1 and v0 #12171

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] Move more control of kv cache initialization from model_executor to EngineCore #11960

[V1] Move more control of kv cache initialization from model_executor to EngineCore #11960

heheda12345 commented Jan 11, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 11, 2025

comaniac left a comment

comaniac Jan 13, 2025

heheda12345 Jan 14, 2025

comaniac Jan 13, 2025

heheda12345 Jan 14, 2025

comaniac Jan 13, 2025

heheda12345 Jan 14, 2025

heheda12345 commented Jan 14, 2025

comaniac left a comment

comaniac Jan 15, 2025

heheda12345 Jan 16, 2025

comaniac Jan 15, 2025

heheda12345 Jan 16, 2025

comaniac Jan 15, 2025

heheda12345 Jan 16, 2025

comaniac Jan 15, 2025

comaniac Jan 15, 2025

heheda12345 Jan 16, 2025

comaniac left a comment

comaniac Jan 16, 2025

heheda12345 commented Jan 16, 2025 •

edited

Loading

heheda12345 commented Jan 17, 2025

youkaichao Jan 17, 2025

youkaichao Jan 17, 2025

youkaichao Jan 17, 2025

youkaichao Jan 17, 2025

		return kv_cache_config, num_gpu_blocks


		def is_same_key(kv_cache_spec: KVCacheSpec) -> bool:

		# [group_id][layer_name in the group]. One group containing all
		# layer_names if the Spec for kv_cache of all layers are the same

		f"initializing the engine.")


		def is_same_type(kv_cache_spec: KVCacheSpec) -> bool:

		def get_kv_cache_config(vllm_config: VllmConfig, kv_cache_spec: KVCacheSpec,
		available_memory: int) -> Tuple[KVCacheConfig, int]:

		Bind kv_caches to the forward context and model_runner's kv_cache.
		Args:

		assert all(kv_caches[n] is kv_caches[layer_name]
		for n in layer_names[1:])

		self._run_workers("compile_or_warm_up_model")

		def get_kv_cache_spec(self) -> KVCacheSpec:

[V1] Move more control of kv cache initialization from model_executor to EngineCore #11960

[V1] Move more control of kv cache initialization from model_executor to EngineCore #11960

Conversation

heheda12345 commented Jan 11, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 11, 2025

comaniac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heheda12345 commented Jan 14, 2025

comaniac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comaniac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heheda12345 commented Jan 16, 2025 • edited Loading

heheda12345 commented Jan 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heheda12345 commented Jan 11, 2025 •

edited by github-actions bot

Loading

heheda12345 commented Jan 16, 2025 •

edited

Loading