-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Kernel][Misc] Support external swapper for vllm #8018
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
pin_memory has a great impact on swapping blocks. more specifically benchmarks/kernels/benchmark_swap_blocks.py
|
@noooop Thank you very much for your suggestion. I will make some changes and conduct relevant tests. And could you help me find other reviewers? Help me see if this solution is feasible? |
|
Add a benchmark result. |
28acb63
to
9009e50
Compare
When we create the llm engine, we support specifying the type of external swapper and the size of the media type. Currently, only local files are supported. Developers can customize the format type of the external swapper. For example, the format of the file is "file://path/to/directory". And supports users to add their own external swapper. Change-Id: I7989e7aba32ad218629c067fc984c8744c25ab64 Signed-off-by: Changqi Lu <[email protected]>
ff06ee6
to
2d264a2
Compare
Firstly, two methods determine_num_external_available_blocks and initialize_external_cache are added to determine the number of external blocks and perform initialization operations when vllm-engine is started. Then, according to the scheduling results of shceduler, the blocks that need to be swapped are divided into two types: CPU and External, and then the block swap operation is performed. Change-Id: I155439e6a8af21ae2241c3eba892f82b7c03fcb2 Signed-off-by: Changqi Lu <[email protected]>
2d264a2
to
3826021
Compare
In the previous version of schedule, the blocks_to_swap_out field in the return result only has the mapping of GPU to CPU block id. However, after adding devices other than CPU, it is necessary to distinguish different devices, so device related information is added to this field. At the same time, external swapper related methods are added to scheduler and BlockManager, and corresponding implementations are made. Change-Id: Iaec7fd7df17fa99c06bfc94d6cb314f44bc04522 Signed-off-by: Changqi Lu <[email protected]>
The ExternalSwapperBase interface is extracted, and users can implement their own ExternalSwapper. ExternalSwapperBase corresponds one-to-one with cache_engine and The interaction between cache_engine and external swapper can be called into the specific implementation class of the corresponding external swapper. Change-Id: I585520bda1298bcb32e501d5f090299e4a21e1ad Signed-off-by: Changqi Lu <[email protected]>
Implementing external swapper for local file. Change-Id: Ia57e8b62c68ea9f32cb03d29047131194fa245a8 Signed-off-by: Changqi Lu <[email protected]>
3826021
to
68216ee
Compare
Add swap_out_to_local_file and swap_in_from_local_file kernels their and corresponding tests and benchmarks. Change-Id: I222a819ee57a5604d5b1e9a61c5614d3208ef251 Signed-off-by: Changqi Lu <[email protected]>
Added num_cumulative_preemption and external_cache_usage_sys indicators to metrics Change-Id: Id7f9c811cb3874e989b596ef9dbaa8b91442f875 Signed-off-by: Changqi Lu <[email protected]>
benchmark_throughput support external swapper configuration. Change-Id: Ib139ab8fb2c63aecf80a5251f855cdbb14b3da41 Signed-off-by: Changqi Lu <[email protected]>
68216ee
to
91b4789
Compare
@DarkLight1337 @ywang96 @youkaichao hi, I have simply implemented an external storage. Please help review the codes. |
This pull request has merge conflicts that must be resolved before it can be |
Hi,
In the previous version of vllm, when the GPU memory is insufficient,
the kv cache needs to be swapped to the CPU memory. Based on the CPU,
we abstracted an external swapper interface and implemented the local file
to store kv cache. Other distributed storage implementations may be added in the
future. By adding external swapper, the storage space of token is greatly expanded.
The specific design is as follows:
The kv cache is stored in a hierarchical structure, with the following levels:
GPU ---> CPU ----> External Swapper
.The storage space is getting bigger, but the latency is also getting longer.
When the generated kv cache space is small, it will be stored in the area
with smaller latency first. When the space in this area is insufficient, it will
be swapped to the outside area.
We continue to use the previous vllm scheduling and executor related strategies,
and simply abstract an external swapper interface under the cache engine to
connect to different swappers.
Currently, the external swapper of local file is implemented, and the external
swapper of valkey (RDMA version of redis) will be implemented next.
For the implemented local file, we also conducted some benchmark tests.
Our test environment is Nvidia A10 with 4 cards and NVME disk.
Kernel benchmark(benchmark/kernels/benchmark_swap_blocks):
The execution time of a single kernel increased by about 60%.
Avg. GPU->CPU time taken for swapping blocks: 0.016023850440979003 seconds
Avg. GPU->File time taken for swapping blocks: 0.026778271198272707 seconds
Avg. File->GPU time taken for swapping blocks: 0.025919318199157715 seconds
Online server benchmark
The throughput performance of tokens has dropped very little, about 2%.
2.1 Swap to CPU(997 times swap):
Server:
python3 -m vllm.entrypoints.openai.api_server --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --num-gpu-blocks-override 270 --preemption-mode swap --tensor-parallel-size 4 --swap-space 40
Client:
python3 benchmark_serving.py --backend vllm --dataset-name random --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --random-input-len 1024 --random-output-len 1024 --request-rate 100 --num-prompts 1000
2.2 Swap to File(996 times swap):
Server:
python3 -m vllm.entrypoints.openai.api_server --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --num-gpu-blocks-override 270 --preemption-mode swap --swap-space 0 --external-swapper file:///root/test --external-swapper-space 40 --tensor-parallel-size 4
Client:
python3 benchmark_serving.py --backend vllm --dataset-name random --model /root/Llama-2-7b-hf/ --tokenizer /root/Llama-2-7b-hf/ --random-input-len 1024 --random-output-len 1024 --request-rate 100 --num-prompts 1000