Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Kernel][Misc] Support external swapper for vllm #8018

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

zeroorhero
Copy link

@zeroorhero zeroorhero commented Aug 30, 2024

Hi,

In the previous version of vllm, when the GPU memory is insufficient,
the kv cache needs to be swapped to the CPU memory. Based on the CPU,
we abstracted an external swapper interface and implemented the local file
to store kv cache. Other distributed storage implementations may be added in the
future. By adding external swapper, the storage space of token is greatly expanded.

The specific design is as follows:
The kv cache is stored in a hierarchical structure, with the following levels:

GPU ---> CPU ----> External Swapper.

The storage space is getting bigger, but the latency is also getting longer.
When the generated kv cache space is small, it will be stored in the area
with smaller latency first. When the space in this area is insufficient, it will
be swapped to the outside area.
We continue to use the previous vllm scheduling and executor related strategies,
and simply abstract an external swapper interface under the cache engine to
connect to different swappers.

Currently, the external swapper of local file is implemented, and the external
swapper of valkey (RDMA version of redis) will be implemented next.

For the implemented local file, we also conducted some benchmark tests.
Our test environment is Nvidia A10 with 4 cards and NVME disk.

  1. Kernel benchmark(benchmark/kernels/benchmark_swap_blocks):
    The execution time of a single kernel increased by about 60%.
    Avg. GPU->CPU time taken for swapping blocks: 0.016023850440979003 seconds
    Avg. GPU->File time taken for swapping blocks: 0.026778271198272707 seconds
    Avg. File->GPU time taken for swapping blocks: 0.025919318199157715 seconds

  2. Online server benchmark
    The throughput performance of tokens has dropped very little, about 2%.
    2.1 Swap to CPU(997 times swap):
    Server:
    python3 -m vllm.entrypoints.openai.api_server --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --num-gpu-blocks-override 270 --preemption-mode swap --tensor-parallel-size 4 --swap-space 40
    Client:
    python3 benchmark_serving.py --backend vllm --dataset-name random --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --random-input-len 1024 --random-output-len 1024 --request-rate 100 --num-prompts 1000
    图片1

    2.2 Swap to File(996 times swap):
    Server:
    python3 -m vllm.entrypoints.openai.api_server --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --num-gpu-blocks-override 270 --preemption-mode swap --swap-space 0 --external-swapper file:///root/test --external-swapper-space 40 --tensor-parallel-size 4
    Client:
    python3 benchmark_serving.py --backend vllm --dataset-name random --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --random-input-len 1024 --random-output-len 1024 --request-rate 100 --num-prompts 1000

图片2

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@noooop
Copy link
Contributor

noooop commented Aug 30, 2024

pin_memory has a great impact on swapping blocks.

more specifically

benchmarks/kernels/benchmark_swap_blocks.py

+ from light_vllm.utils import is_pin_memory_available
+ pin_memory = is_pin_memory_available()

- dst = torch.zeros_like(src).cpu()
+ dst = torch.zeros_like(src, pin_memory=pin_memory, device="cpu")

Avg. GPU->CPU time taken for swapping blocks: 0.016023850440979003 seconds
16ms is so weird

@zeroorhero
Copy link
Author

pin_memory has a great impact on swapping blocks.

more specifically

benchmarks/kernels/benchmark_swap_blocks.py

+ from light_vllm.utils import is_pin_memory_available
+ pin_memory = is_pin_memory_available()

- dst = torch.zeros_like(src).cpu()
+ dst = torch.zeros_like(src, pin_memory=pin_memory, device="cpu")

Avg. GPU->CPU time taken for swapping blocks: 0.016023850440979003 seconds
16ms is so weird

@noooop Thank you very much for your suggestion. I will make some changes and conduct relevant tests. And could you help me find other reviewers? Help me see if this solution is feasible?

@noooop
Copy link
Contributor

noooop commented Aug 30, 2024

How should I say it tactfully?

in my scenario ddr4 3600 32G *4 only has ≈20G/s bandwidth (Compare with 4090 1T/s), so cpu memory SWAP is almost useless to me.

@noooop
Copy link
Contributor

noooop commented Aug 30, 2024

Maybe external swapper can be used in future async schedulers, it is too slow for current synchronous schedulers.

@zeroorhero
Copy link
Author

Add a benchmark result.
2. Online server benchmark
2.3 Recompute(997 times recompute):
The recompute result is basically the same as the result of swapping to the CPU.
Server:
python3 -m vllm.entrypoints.openai.api_server --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --num-gpu-blocks-override 270 --preemption-mode recompute --tensor-parallel-size 4
Client:
python3 benchmark_serving.py --backend vllm --dataset-name random --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --random-input-len 1024 --random-output-len 1024 --request-rate 100 --num-prompts 1000

图片3

@zeroorhero zeroorhero force-pushed the add-external-swapper branch 9 times, most recently from 28acb63 to 9009e50 Compare September 3, 2024 02:42
When we create the llm engine, we support
specifying the type of external swapper
and the size of the media type. Currently,
only local files are supported. Developers
can customize the format type of the external
swapper. For example, the format of the file
is "file://path/to/directory".
And supports users to add their own external swapper.

Change-Id: I7989e7aba32ad218629c067fc984c8744c25ab64
Signed-off-by: Changqi Lu <[email protected]>
@zeroorhero zeroorhero force-pushed the add-external-swapper branch 2 times, most recently from ff06ee6 to 2d264a2 Compare September 3, 2024 07:44
Firstly, two methods determine_num_external_available_blocks and
initialize_external_cache are added to determine the number of
external blocks and perform initialization operations when
vllm-engine is started.

Then, according to the scheduling results of shceduler, the blocks
that need to be swapped are divided into two types: CPU and External,
and then the block swap operation is performed.

Change-Id: I155439e6a8af21ae2241c3eba892f82b7c03fcb2
Signed-off-by: Changqi Lu <[email protected]>
In the previous version of schedule, the blocks_to_swap_out field
in the return result only has the mapping of GPU to CPU block id.
However, after adding devices other than CPU, it is necessary to
distinguish different devices, so device related information is
added to this field.

At the same time, external swapper related methods are added to
scheduler and BlockManager, and corresponding implementations are made.

Change-Id: Iaec7fd7df17fa99c06bfc94d6cb314f44bc04522
Signed-off-by: Changqi Lu <[email protected]>
The ExternalSwapperBase interface is extracted,
and users can implement their own ExternalSwapper.
ExternalSwapperBase corresponds one-to-one with cache_engine
and The interaction between cache_engine and external swapper
can be called into the specific implementation class of the
corresponding external swapper.

Change-Id: I585520bda1298bcb32e501d5f090299e4a21e1ad
Signed-off-by: Changqi Lu <[email protected]>
Implementing external swapper for local file.

Change-Id: Ia57e8b62c68ea9f32cb03d29047131194fa245a8
Signed-off-by: Changqi Lu <[email protected]>
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 3, 2024
Add swap_out_to_local_file and swap_in_from_local_file
kernels their and corresponding tests and benchmarks.

Change-Id: I222a819ee57a5604d5b1e9a61c5614d3208ef251
Signed-off-by: Changqi Lu <[email protected]>
Added num_cumulative_preemption and external_cache_usage_sys
indicators to metrics

Change-Id: Id7f9c811cb3874e989b596ef9dbaa8b91442f875
Signed-off-by: Changqi Lu <[email protected]>
benchmark_throughput support external swapper configuration.

Change-Id: Ib139ab8fb2c63aecf80a5251f855cdbb14b3da41
Signed-off-by: Changqi Lu <[email protected]>
@zeroorhero
Copy link
Author

@DarkLight1337 @ywang96 @youkaichao hi, I have simply implemented an external storage. Please help review the codes.

Copy link

mergify bot commented Nov 26, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zeroorhero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend needs-rebase ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants