[Core][Kernel][Misc] Support external swapper for vllm #8018

zeroorhero · 2024-08-30T03:40:22Z

Hi,

In the previous version of vllm, when the GPU memory is insufficient,
the kv cache needs to be swapped to the CPU memory. Based on the CPU,
we abstracted an external swapper interface and implemented the local file
to store kv cache. Other distributed storage implementations may be added in the
future. By adding external swapper, the storage space of token is greatly expanded.

The specific design is as follows:
The kv cache is stored in a hierarchical structure, with the following levels:

GPU ---> CPU ----> External Swapper.

The storage space is getting bigger, but the latency is also getting longer.
When the generated kv cache space is small, it will be stored in the area
with smaller latency first. When the space in this area is insufficient, it will
be swapped to the outside area.
We continue to use the previous vllm scheduling and executor related strategies,
and simply abstract an external swapper interface under the cache engine to
connect to different swappers.

Currently, the external swapper of local file is implemented, and the external
swapper of valkey (RDMA version of redis) will be implemented next.

For the implemented local file, we also conducted some benchmark tests.
Our test environment is Nvidia A10 with 4 cards and NVME disk.

Kernel benchmark（benchmark/kernels/benchmark_swap_blocks）:
The execution time of a single kernel increased by about 60%.
Avg. GPU->CPU time taken for swapping blocks: 0.016023850440979003 seconds
Avg. GPU->File time taken for swapping blocks: 0.026778271198272707 seconds
Avg. File->GPU time taken for swapping blocks: 0.025919318199157715 seconds
Online server benchmark
The throughput performance of tokens has dropped very little, about 2%.
2.1 Swap to CPU(997 times swap):
Server:
python3 -m vllm.entrypoints.openai.api_server --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --num-gpu-blocks-override 270 --preemption-mode swap --tensor-parallel-size 4 --swap-space 40
Client:
python3 benchmark_serving.py --backend vllm --dataset-name random --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --random-input-len 1024 --random-output-len 1024 --request-rate 100 --num-prompts 1000

2.2 Swap to File(996 times swap):
Server:
python3 -m vllm.entrypoints.openai.api_server --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --num-gpu-blocks-override 270 --preemption-mode swap --swap-space 0 --external-swapper file:///root/test --external-swapper-space 40 --tensor-parallel-size 4
Client:
python3 benchmark_serving.py --backend vllm --dataset-name random --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --random-input-len 1024 --random-output-len 1024 --request-rate 100 --num-prompts 1000

github-actions · 2024-08-30T03:40:32Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

noooop · 2024-08-30T06:18:21Z

pin_memory has a great impact on swapping blocks.

more specifically

benchmarks/kernels/benchmark_swap_blocks.py

+ from light_vllm.utils import is_pin_memory_available
+ pin_memory = is_pin_memory_available()

- dst = torch.zeros_like(src).cpu()
+ dst = torch.zeros_like(src, pin_memory=pin_memory, device="cpu")

Avg. GPU->CPU time taken for swapping blocks: 0.016023850440979003 seconds
16ms is so weird

zeroorhero · 2024-08-30T06:26:45Z

pin_memory has a great impact on swapping blocks.

more specifically

benchmarks/kernels/benchmark_swap_blocks.py
+ from light_vllm.utils import is_pin_memory_available
+ pin_memory = is_pin_memory_available()

- dst = torch.zeros_like(src).cpu()
+ dst = torch.zeros_like(src, pin_memory=pin_memory, device="cpu")
Avg. GPU->CPU time taken for swapping blocks: 0.016023850440979003 seconds
16ms is so weird

@noooop Thank you very much for your suggestion. I will make some changes and conduct relevant tests. And could you help me find other reviewers? Help me see if this solution is feasible?

noooop · 2024-08-30T06:33:29Z

~~How should I say it tactfully?~~

~~in my scenario ddr4 3600 32G *4 only has ≈20G/s bandwidth （Compare with 4090 1T/s）, so cpu memory SWAP is almost useless to me.~~

noooop · 2024-08-30T06:41:04Z

~~Maybe external swapper can be used in future async schedulers, it is too slow for current synchronous schedulers.~~

zeroorhero · 2024-08-30T09:00:40Z

Add a benchmark result.
2. Online server benchmark
2.3 Recompute(997 times recompute):
The recompute result is basically the same as the result of swapping to the CPU.
Server:
python3 -m vllm.entrypoints.openai.api_server --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --num-gpu-blocks-override 270 --preemption-mode recompute --tensor-parallel-size 4
Client:
python3 benchmark_serving.py --backend vllm --dataset-name random --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --random-input-len 1024 --random-output-len 1024 --request-rate 100 --num-prompts 1000

When we create the llm engine, we support specifying the type of external swapper and the size of the media type. Currently, only local files are supported. Developers can customize the format type of the external swapper. For example, the format of the file is "file://path/to/directory". And supports users to add their own external swapper. Change-Id: I7989e7aba32ad218629c067fc984c8744c25ab64 Signed-off-by: Changqi Lu <[email protected]>

Firstly, two methods determine_num_external_available_blocks and initialize_external_cache are added to determine the number of external blocks and perform initialization operations when vllm-engine is started. Then, according to the scheduling results of shceduler, the blocks that need to be swapped are divided into two types: CPU and External, and then the block swap operation is performed. Change-Id: I155439e6a8af21ae2241c3eba892f82b7c03fcb2 Signed-off-by: Changqi Lu <[email protected]>

In the previous version of schedule, the blocks_to_swap_out field in the return result only has the mapping of GPU to CPU block id. However, after adding devices other than CPU, it is necessary to distinguish different devices, so device related information is added to this field. At the same time, external swapper related methods are added to scheduler and BlockManager, and corresponding implementations are made. Change-Id: Iaec7fd7df17fa99c06bfc94d6cb314f44bc04522 Signed-off-by: Changqi Lu <[email protected]>

The ExternalSwapperBase interface is extracted, and users can implement their own ExternalSwapper. ExternalSwapperBase corresponds one-to-one with cache_engine and The interaction between cache_engine and external swapper can be called into the specific implementation class of the corresponding external swapper. Change-Id: I585520bda1298bcb32e501d5f090299e4a21e1ad Signed-off-by: Changqi Lu <[email protected]>

Implementing external swapper for local file. Change-Id: Ia57e8b62c68ea9f32cb03d29047131194fa245a8 Signed-off-by: Changqi Lu <[email protected]>

Add swap_out_to_local_file and swap_in_from_local_file kernels their and corresponding tests and benchmarks. Change-Id: I222a819ee57a5604d5b1e9a61c5614d3208ef251 Signed-off-by: Changqi Lu <[email protected]>

Added num_cumulative_preemption and external_cache_usage_sys indicators to metrics Change-Id: Id7f9c811cb3874e989b596ef9dbaa8b91442f875 Signed-off-by: Changqi Lu <[email protected]>

benchmark_throughput support external swapper configuration. Change-Id: Ib139ab8fb2c63aecf80a5251f855cdbb14b3da41 Signed-off-by: Changqi Lu <[email protected]>

zeroorhero · 2024-09-03T13:28:36Z

@DarkLight1337 @ywang96 @youkaichao hi, I have simply implemented an external storage. Please help review the codes.

mergify · 2024-11-26T05:50:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zeroorhero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

noooop mentioned this pull request Aug 30, 2024

[Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. #7874

Merged

zeroorhero force-pushed the add-external-swapper branch 9 times, most recently from 28acb63 to 9009e50 Compare September 3, 2024 02:42

zeroorhero force-pushed the add-external-swapper branch 2 times, most recently from ff06ee6 to 2d264a2 Compare September 3, 2024 07:44

zeroorhero force-pushed the add-external-swapper branch from 2d264a2 to 3826021 Compare September 3, 2024 09:13

zeroorhero added 3 commits September 3, 2024 19:16

[Core] Add external swappwer File

763e7cb

Implementing external swapper for local file. Change-Id: Ia57e8b62c68ea9f32cb03d29047131194fa245a8 Signed-off-by: Changqi Lu <[email protected]>

zeroorhero force-pushed the add-external-swapper branch from 3826021 to 68216ee Compare September 3, 2024 11:29

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 3, 2024

zeroorhero added 3 commits September 3, 2024 21:19

[Kernel] Add external swapper file kernels

fba3cce

Add swap_out_to_local_file and swap_in_from_local_file kernels their and corresponding tests and benchmarks. Change-Id: I222a819ee57a5604d5b1e9a61c5614d3208ef251 Signed-off-by: Changqi Lu <[email protected]>

[Core] Add num_cumulative_preemption and external_cache_usage_sys

65b5315

Added num_cumulative_preemption and external_cache_usage_sys indicators to metrics Change-Id: Id7f9c811cb3874e989b596ef9dbaa8b91442f875 Signed-off-by: Changqi Lu <[email protected]>

[Misc] Benchmark_throughput support external swapper

91b4789

benchmark_throughput support external swapper configuration. Change-Id: Ib139ab8fb2c63aecf80a5251f855cdbb14b3da41 Signed-off-by: Changqi Lu <[email protected]>

zeroorhero force-pushed the add-external-swapper branch from 68216ee to 91b4789 Compare September 3, 2024 13:19

DarkLight1337 requested a review from youkaichao September 3, 2024 13:37

zeroorhero mentioned this pull request Oct 12, 2024

[Roadmap] vLLM Roadmap Q4 2024 #9006

Open

40 tasks

simon-mo requested review from tlrmchlsmth, WoosukKwon, zhuohan123, alexm-redhat, comaniac and njhill as code owners November 26, 2024 05:49

mergify bot added the frontend label Nov 26, 2024

mergify bot added the needs-rebase label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Kernel][Misc] Support external swapper for vllm #8018

[Core][Kernel][Misc] Support external swapper for vllm #8018

zeroorhero commented Aug 30, 2024 •

edited

Loading

github-actions bot commented Aug 30, 2024

noooop commented Aug 30, 2024 •

edited

Loading

zeroorhero commented Aug 30, 2024

noooop commented Aug 30, 2024 •

edited

Loading

noooop commented Aug 30, 2024 •

edited

Loading

zeroorhero commented Aug 30, 2024

zeroorhero commented Sep 3, 2024

mergify bot commented Nov 26, 2024

[Core][Kernel][Misc] Support external swapper for vllm #8018

Are you sure you want to change the base?

[Core][Kernel][Misc] Support external swapper for vllm #8018

Conversation

zeroorhero commented Aug 30, 2024 • edited Loading

github-actions bot commented Aug 30, 2024

noooop commented Aug 30, 2024 • edited Loading

zeroorhero commented Aug 30, 2024

noooop commented Aug 30, 2024 • edited Loading

noooop commented Aug 30, 2024 • edited Loading

zeroorhero commented Aug 30, 2024

zeroorhero commented Sep 3, 2024

mergify bot commented Nov 26, 2024

zeroorhero commented Aug 30, 2024 •

edited

Loading

noooop commented Aug 30, 2024 •

edited

Loading

noooop commented Aug 30, 2024 •

edited

Loading

noooop commented Aug 30, 2024 •

edited

Loading