[Model] Initialize support for Deepseek-VL2 models #11578

Isotr0py · 2024-12-28T05:20:14Z

FIX #11236

Initialize support for deepseek-vl2 series models
Note that deepseek-ai/deepseek-vl2-tiny is not supported yet because it doesn't use MLA attention.

Signed-off-by: Isotr0py <[email protected]>

github-actions · 2024-12-28T05:20:25Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Isotr0py <[email protected]>

mergify · 2024-12-28T16:56:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Isotr0py <[email protected]>

iamweiliu · 2025-01-14T10:12:26Z

File "/data/venvs/lib/vllm/vllm/model_executor/models/registry.py", line 384, in inspect_model_cls
for arch in architectures:
TypeError: 'NoneType' object is not iterable

The config.json in Deepseek-VL2's model repos are all missing the architectures field, so you need to specify --hf_overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}' or add architectures": ["DeepseekVLV2ForCausalLM"] to the config file manually.

Save my life!

iamweiliu · 2025-01-14T10:15:58Z

ERROR 01-14 18:14:14 engine.py:387] AttributeError: 'DeepseekVLV2Config' object has no attribute 'hidden_size'

Isotr0py · 2025-01-14T10:32:04Z

@iamweiliu Can you provide the full logs? hidden_size should not be got from DeepseekVLV2Config because it doesn't have this field.

iamweiliu · 2025-01-14T14:59:45Z

@iamweiliu Can you provide the full logs? hidden_size should not be got from DeepseekVLV2Config because it doesn't have this field.

I already fix it. Just install https://github.com/Isotr0py/DeepSeek-VL2.

* [Bugfix][V1] Fix molmo text-only inputs (vllm-project#11676) Signed-off-by: Jee Jee Li <[email protected]> * [Kernel] Move attn_type to Attention.__init__() (vllm-project#11690) Signed-off-by: Chen Zhang <[email protected]> * [V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision (vllm-project#11685) Signed-off-by: Roger Wang <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * [Bugfix] Fix LLaVA-NeXT feature size precision error (for real) (vllm-project#11772) Signed-off-by: DarkLight1337 <[email protected]> * [Model] Future-proof Qwen2-Audio multi-modal processor (vllm-project#11776) Signed-off-by: DarkLight1337 <[email protected]> * [XPU] Make pp group initilized for pipeline-parallelism (vllm-project#11648) Signed-off-by: yisheng <[email protected]> * [Doc][3/N] Reorganize Serving section (vllm-project#11766) Signed-off-by: DarkLight1337 <[email protected]> * [Kernel][LoRA]Punica prefill kernels fusion (vllm-project#11234) Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Abatom <[email protected]> Co-authored-by: Zhonghua Deng <[email protected]> * [Bugfix] Update attention interface in `Whisper` (vllm-project#11784) Signed-off-by: Roger Wang <[email protected]> * [CI] Fix neuron CI and run offline tests (vllm-project#11779) Signed-off-by: Liangfu Chen <[email protected]> * fix init error for MessageQueue when n_local_reader is zero (vllm-project#11768) * [Doc] Create a vulnerability management team (vllm-project#9925) Signed-off-by: Russell Bryant <[email protected]> * [CI][CPU] adding build number to docker image name (vllm-project#11788) Signed-off-by: Yuan Zhou <[email protected]> * [V1][Doc] Update V1 support for `LLaVa-NeXT-Video` (vllm-project#11798) Signed-off-by: Roger Wang <[email protected]> * [Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation (vllm-project#11800) Signed-off-by: DarkLight1337 <[email protected]> * [doc] add doc to explain how to use uv (vllm-project#11773) Signed-off-by: youkaichao <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [V1] Support audio language models on V1 (vllm-project#11733) Signed-off-by: Roger Wang <[email protected]> * [doc] update how pip can install nightly wheels (vllm-project#11806) Signed-off-by: youkaichao <[email protected]> * [Doc] Add note to `gte-Qwen2` models (vllm-project#11808) Signed-off-by: DarkLight1337 <[email protected]> * [optimization] remove python function call for custom op (vllm-project#11750) Signed-off-by: youkaichao <[email protected]> * [Bugfix] update the prefix for qwen2 (vllm-project#11795) Co-authored-by: jiadi.jjd <[email protected]> * [Doc]Add documentation for using EAGLE in vLLM (vllm-project#11417) Signed-off-by: Sourashis Roy <[email protected]> * [Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 (vllm-project#11794) * [Doc] Group examples into categories (vllm-project#11782) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] Fix image input for Pixtral-HF (vllm-project#11741) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] sort torch profiler table by kernel timing (vllm-project#11813) * Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… (vllm-project#11824) * Fixed docker build for ppc64le (vllm-project#11518) Signed-off-by: Nishidha Panpaliya <[email protected]> * [OpenVINO] Fixed Docker.openvino build (vllm-project#11732) Signed-off-by: Ilya Lavrenov <[email protected]> * [Bugfix] Add checks for LoRA and CPU offload (vllm-project#11810) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] reorganize sponsorship page (vllm-project#11639) Signed-off-by: simon-mo <[email protected]> * [Bug] Fix pickling of `ModelConfig` when RunAI Model Streamer is used (vllm-project#11825) Signed-off-by: DarkLight1337 <[email protected]> * [misc] improve memory profiling (vllm-project#11809) Signed-off-by: youkaichao <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [doc] update wheels url (vllm-project#11830) Signed-off-by: youkaichao <[email protected]> * [Docs] Update sponsor name: 'Novita' to 'Novita AI' (vllm-project#11833) * [Hardware][Apple] Native support for macOS Apple Silicon (vllm-project#11696) Signed-off-by: Wallas Santos <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [torch.compile] consider relevant code in compilation cache (vllm-project#11614) Signed-off-by: youkaichao <[email protected]> * [VLM] Reorganize profiling/processing-related code (vllm-project#11812) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Move examples into categories (vllm-project#11840) Signed-off-by: Harry Mellor <[email protected]> * [Doc][4/N] Reorganize API Reference (vllm-project#11843) Signed-off-by: DarkLight1337 <[email protected]> * [CI/Build][Bugfix] Fix CPU CI image clean up (vllm-project#11836) Signed-off-by: jiang1.li <[email protected]> * [Bugfix][XPU] fix silu_and_mul (vllm-project#11823) Signed-off-by: yan ma <[email protected]> * [Misc] Move some model utils into vision file (vllm-project#11848) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Expand Multimodal API Reference (vllm-project#11852) Signed-off-by: DarkLight1337 <[email protected]> * [Misc]add some explanations for BlockHashType (vllm-project#11847) * [TPU][Quantization] TPU `W8A8` (vllm-project#11785) Co-authored-by: Woosuk Kwon <[email protected]> * [Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models (vllm-project#11698) Signed-off-by: Randall Smith <[email protected]> * [Docs] Add Google Cloud Meetup (vllm-project#11864) * [CI] Turn on basic correctness tests for V1 (vllm-project#10864) * treat do_lower_case in the same way as the sentence-transformers library (vllm-project#11815) Signed-off-by: Max de Bayser <[email protected]> * [Doc] Recommend uv and python 3.12 for quickstart guide (vllm-project#11849) Signed-off-by: mgoin <[email protected]> * [Misc] Move `print_*_once` from utils to logger (vllm-project#11298) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]> Co-authored-by: Maxime Fournioux <[email protected]> * [Doc] Intended links Python multiprocessing library (vllm-project#11878) * [perf]fix current stream (vllm-project#11870) Signed-off-by: youkaichao <[email protected]> * [Bugfix] Override dunder methods of placeholder modules (vllm-project#11882) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] fix beam search input errors and latency benchmark script (vllm-project#11875) Signed-off-by: Ye Qi <[email protected]> Co-authored-by: yeq <[email protected]> * [Doc] Add model development API Reference (vllm-project#11884) Signed-off-by: DarkLight1337 <[email protected]> * [platform] Allow platform specify attention backend (vllm-project#11609) Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> * [ci]try to fix flaky multi-step tests (vllm-project#11894) Signed-off-by: youkaichao <[email protected]> * [Misc] Provide correct Pixtral-HF chat template (vllm-project#11891) Signed-off-by: DarkLight1337 <[email protected]> * [Docs] Add Modal to deployment frameworks (vllm-project#11907) * [Doc][5/N] Move Community and API Reference to the bottom (vllm-project#11896) Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: Simon Mo <[email protected]> * [VLM] Enable tokenized inputs for merged multi-modal processor (vllm-project#11900) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Show default pooling method in a table (vllm-project#11904) Signed-off-by: DarkLight1337 <[email protected]> * [torch.compile] Hide KV cache behind torch.compile boundary (vllm-project#11677) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix] Validate lora adapters to avoid crashing server (vllm-project#11727) Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [BUGFIX] Fix `UnspecifiedPlatform` package name (vllm-project#11916) Signed-off-by: Kunshang Ji <[email protected]> * [ci] fix gh200 tests (vllm-project#11919) Signed-off-by: youkaichao <[email protected]> * [misc] remove python function call for custom activation op (vllm-project#11885) Co-authored-by: youkaichao <[email protected]> * [platform] support pytorch custom op pluggable (vllm-project#11328) Signed-off-by: wangxiyuan <[email protected]> * Replace "online inference" with "online serving" (vllm-project#11923) Signed-off-by: Harry Mellor <[email protected]> * [ci] Fix sampler tests (vllm-project#11922) Signed-off-by: youkaichao <[email protected]> * [Doc] [1/N] Initial guide for merged multi-modal processor (vllm-project#11925) Signed-off-by: DarkLight1337 <[email protected]> * [platform] support custom torch.compile backend key (vllm-project#11318) Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]> * [Doc] Rename offline inference examples (vllm-project#11927) Signed-off-by: Harry Mellor <[email protected]> * [Docs] Fix docstring in `get_ip` function (vllm-project#11932) Signed-off-by: Kuntai Du <[email protected]> * Doc fix in `benchmark_long_document_qa_throughput.py` (vllm-project#11933) Signed-off-by: Kuntai Du <[email protected]> * [Hardware][CPU] Support MOE models on x86 CPU (vllm-project#11831) Signed-off-by: jiang1.li <[email protected]> * [Misc] Clean up debug code in Deepseek-V3 (vllm-project#11930) Signed-off-by: Isotr0py <[email protected]> * [Misc] Update benchmark_prefix_caching.py fixed example usage (vllm-project#11920) Signed-off-by: Ren MinMin <[email protected]> Co-authored-by: Ren MinMin <[email protected]> * [Bugfix] Check that number of images matches number of <|image|> tokens with mllama (vllm-project#11939) Signed-off-by: Travis Johnson <[email protected]> * [mypy] Fix mypy warnings in api_server.py (vllm-project#11941) Signed-off-by: Fred Reiss <[email protected]> * [ci] fix broken distributed-tests-4-gpus (vllm-project#11937) Signed-off-by: youkaichao <[email protected]> * [Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design (vllm-project#11672) Signed-off-by: Sungjae Lee <[email protected]> * [Bugfix] fused_experts_impl wrong compute type for float32 (vllm-project#11921) Signed-off-by: shaochangxu.scx <[email protected]> Co-authored-by: shaochangxu.scx <[email protected]> * [CI/Build] Move model-specific multi-modal processing tests (vllm-project#11934) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Basic guide for writing unit tests for new models (vllm-project#11951) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix RobertaModel loading (vllm-project#11940) Signed-off-by: NickLucche <[email protected]> * [Model] Add cogagent model support vLLM (vllm-project#11742) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [V1] Avoid sending text prompt to core engine (vllm-project#11963) Signed-off-by: Roger Wang <[email protected]> * [CI/Build] Add markdown linter (vllm-project#11857) Signed-off-by: Rafael Vasquez <[email protected]> * [Model] Initialize support for Deepseek-VL2 models (vllm-project#11578) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Hardware][CPU] Multi-LoRA implementation for the CPU backend (vllm-project#11100) Signed-off-by: Akshat Tripathi <[email protected]> Signed-off-by: Oleg Mosalov <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Oleg Mosalov <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [Hardware][TPU] workaround fix for MoE on TPU (vllm-project#11764) * [V1][Core][1/n] Logging and Metrics (vllm-project#11962) Signed-off-by: [email protected] <[email protected]> * [Model] Support GGUF models newly added in `transformers` 4.46.0 (vllm-project#9685) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (vllm-project#11973) Signed-off-by: [email protected] <[email protected]> * [MISC] fix typo in kv transfer send recv test (vllm-project#11983) * [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (vllm-project#11979) * [CI][Spec Decode] fix: broken test for EAGLE model (vllm-project#11972) Signed-off-by: Sungjae Lee <[email protected]> * [Misc] Fix Deepseek V2 fp8 kv-scale remapping (vllm-project#11947) Signed-off-by: Yida Wu <[email protected]> * [Misc]Minor Changes about Worker (vllm-project#11555) Signed-off-by: Chenguang Li <[email protected]> * [platform] add ray_device_key (vllm-project#11948) Signed-off-by: youkaichao <[email protected]> * Fix Max Token ID for Qwen-VL-Chat (vllm-project#11980) Signed-off-by: Alex-Brooks <[email protected]> * [Kernel] unified_attention for Attention.forward (vllm-project#11967) Signed-off-by: Chen Zhang <[email protected]> * [Doc][V1] Update model implementation guide for V1 support (vllm-project#11998) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Doc] Organise installation documentation into categories and tabs (vllm-project#11935) Signed-off-by: Harry Mellor <[email protected]> * [platform] add device_control env var (vllm-project#12009) Signed-off-by: youkaichao <[email protected]> * [Platform] Move get_punica_wrapper() function to Platform (vllm-project#11516) Signed-off-by: Shanshan Shen <[email protected]> * bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (vllm-project#11982) Signed-off-by: elijah <[email protected]> * Using list * Revert "[misc] improve memory profiling (vllm-project#11809)" This reverts commit 889e662. * Trying to make scales work with compileable attention * Docs lint --------- Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: Roger Wang <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: yisheng <[email protected]> Signed-off-by: Abatom <[email protected]> Signed-off-by: Liangfu Chen <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Yuan Zhou <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Ilya Lavrenov <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: yan ma <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]> Signed-off-by: Ye Qi <[email protected]> Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Kunshang Ji <[email protected]> Signed-off-by: Kuntai Du <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Ren MinMin <[email protected]> Signed-off-by: Travis Johnson <[email protected]> Signed-off-by: Fred Reiss <[email protected]> Signed-off-by: Sungjae Lee <[email protected]> Signed-off-by: shaochangxu.scx <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: Rafael Vasquez <[email protected]> Signed-off-by: Akshat Tripathi <[email protected]> Signed-off-by: Oleg Mosalov <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Chenguang Li <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: Shanshan Shen <[email protected]> Signed-off-by: elijah <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> Co-authored-by: YiSheng5 <[email protected]> Co-authored-by: Zhonghua Deng <[email protected]> Co-authored-by: Liangfu Chen <[email protected]> Co-authored-by: XiaobingZhang <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Yuan <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: jiangjiadi <[email protected]> Co-authored-by: jiadi.jjd <[email protected]> Co-authored-by: sroy745 <[email protected]> Co-authored-by: Jie Fu (傅杰) <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: WangErXiao <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Ilya Lavrenov <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Wallas Henrique <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Yan Ma <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Maxime Fournioux <[email protected]> Co-authored-by: Guspan Tanadi <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: yeq <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> Co-authored-by: Charles Frye <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: cennn <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: minmin <[email protected]> Co-authored-by: Ren MinMin <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Fred Reiss <[email protected]> Co-authored-by: Sungjae Lee <[email protected]> Co-authored-by: shaochangxu <[email protected]> Co-authored-by: shaochangxu.scx <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: sixgod <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Rafael Vasquez <[email protected]> Co-authored-by: Akshat Tripathi <[email protected]> Co-authored-by: Oleg Mosalov <[email protected]> Co-authored-by: Avshalom Manevich <[email protected]> Co-authored-by: Yangcheng Li <[email protected]> Co-authored-by: Siyuan Li <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Chenguang Li <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Shanshan Shen <[email protected]> Co-authored-by: elijah <[email protected]>

* [Misc] Move weights mapper (vllm-project#11443) Signed-off-by: Jee Jee Li <[email protected]> * [Bugfix] Fix issues in CPU build Dockerfile. Fixes vllm-project#9182 (vllm-project#11435) Signed-off-by: Yuan Tang <[email protected]> * [Model] Automatic conversion of classification and reward models (vllm-project#11469) Signed-off-by: DarkLight1337 <[email protected]> * [V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor (vllm-project#11472) * [Misc] Update disaggregation benchmark scripts and test logs (vllm-project#11456) Signed-off-by: Jiaxin Shan <[email protected]> * [Frontend] Enable decord to load video from base64 (vllm-project#11492) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Improve GitHub links (vllm-project#11491) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Move some multimodal utils to modality-specific modules (vllm-project#11494) Signed-off-by: DarkLight1337 <[email protected]> * Mypy checking for vllm/compilation (vllm-project#11496) Signed-off-by: lucast2021 <[email protected]> Co-authored-by: lucast2021 <[email protected]> * [Misc][LoRA] Fix LoRA weight mapper (vllm-project#11495) Signed-off-by: Jee Jee Li <[email protected]> * [Doc] Add `QVQ` and `QwQ` to the list of supported models (vllm-project#11509) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [V1] Adding min tokens/repetition/presence/frequence penalties to V1 sampler (vllm-project#10681) Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> * [Model] Modify MolmoForCausalLM MLP (vllm-project#11510) Signed-off-by: Jee Jee Li <[email protected]> * [Misc] Add placeholder module (vllm-project#11501) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Add video example to openai client for multimodal (vllm-project#11521) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [1/N] API Server (Remove Proxy) (vllm-project#11529) * [Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization (vllm-project#11523) Signed-off-by: mgoin <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: simon-mo <[email protected]> Co-authored-by: simon-mo <[email protected]> Co-authored-by: simon-mo <[email protected]> Co-authored-by: HandH1998 <[email protected]> * [2/N] API Server: Avoid ulimit footgun (vllm-project#11530) * Deepseek v3 (vllm-project#11502) Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: robertgshaw2-neuralmagic <[email protected]> * [Docs] Document Deepseek V3 support (vllm-project#11535) Signed-off-by: simon-mo <[email protected]> * Update openai_compatible_server.md (vllm-project#11536) Co-authored-by: Simon Mo <[email protected]> * [V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling (vllm-project#11394) Signed-off-by: Woosuk Kwon <[email protected]> * [V1] Fix yapf (vllm-project#11538) Signed-off-by: Woosuk Kwon <[email protected]> * [CI] Fix broken CI (vllm-project#11543) * [misc] fix typing (vllm-project#11540) Signed-off-by: youkaichao <[email protected]> * [V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly (vllm-project#11534) * [BugFix] Fix quantization for all other methods (vllm-project#11547) * [Platform] Move model arch check to platform (vllm-project#11503) Signed-off-by: Mengqing Cao <[email protected]> * Update deploying_with_k8s.md with AMD ROCm GPU example (vllm-project#11465) Signed-off-by: Alex He <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Bugfix] Fix TeleChat2ForCausalLM weights mapper (vllm-project#11546) Signed-off-by: Jee Jee Li <[email protected]> * [Misc] Abstract the logic for reading and writing media content (vllm-project#11527) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Add xgrammar in doc (vllm-project#11549) Signed-off-by: ccjincong <[email protected]> * [VLM] Support caching in merged multi-modal processor (vllm-project#11396) Signed-off-by: DarkLight1337 <[email protected]> * [MODEL] LoRA support for Jamba model (vllm-project#11209) Signed-off-by: Erez Schwartz <[email protected]> * [Misc]Add BNB quantization for MolmoForCausalLM (vllm-project#11551) Signed-off-by: Jee Jee Li <[email protected]> * [Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix (vllm-project#11566) Signed-off-by: Isotr0py <[email protected]> * [Bugfix] Fix for ROCM compressed tensor support (vllm-project#11561) * [Doc] Update mllama example based on official doc (vllm-project#11567) Signed-off-by: Chen Zhang <[email protected]> * [V1] [4/N] API Server: ZMQ/MP Utilities (vllm-project#11541) * [Bugfix] Last token measurement fix (vllm-project#11376) Signed-off-by: rajveerb <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Model] Support InternLM2 Reward models (vllm-project#11571) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Model] Remove hardcoded image tokens ids from Pixtral (vllm-project#11582) Signed-off-by: Roger Wang <[email protected]> * [Hardware][AMD]: Replace HIPCC version with more precise ROCm version (vllm-project#11515) Signed-off-by: hjwei <[email protected]> * [V1][Minor] Set pin_memory=False for token_ids_cpu tensor (vllm-project#11581) Signed-off-by: Woosuk Kwon <[email protected]> * [Doc] Minor documentation fixes (vllm-project#11580) Signed-off-by: DarkLight1337 <[email protected]> * [bugfix] interleaving sliding window for cohere2 model (vllm-project#11583) Signed-off-by: youkaichao <[email protected]> * [V1] [5/N] API Server: unify `Detokenizer` and `EngineCore` input (vllm-project#11545) Signed-off-by: [email protected] <[email protected]> * [Doc] Convert list tables to MyST (vllm-project#11594) Signed-off-by: DarkLight1337 <[email protected]> * [v1][bugfix] fix cudagraph with inplace buffer assignment (vllm-project#11596) Signed-off-by: youkaichao <[email protected]> * [Misc] KV cache transfer connector registry (vllm-project#11481) Signed-off-by: KuntaiDu <[email protected]> * Remove print statement in DeepseekScalingRotaryEmbedding (vllm-project#11604) * [v1] fix compilation cache (vllm-project#11598) Signed-off-by: youkaichao <[email protected]> * [Docker] bump up neuron sdk v2.21 (vllm-project#11593) Signed-off-by: Liangfu Chen <[email protected]> * [Build][Kernel] Update CUTLASS to v3.6.0 (vllm-project#11607) Signed-off-by: Tyler Michael Smith <[email protected]> * [CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels (vllm-project#11618) Signed-off-by: jiang1.li <[email protected]> * [platforms] enable platform plugins (vllm-project#11602) Signed-off-by: youkaichao <[email protected]> * [VLM] Abstract out multi-modal data parsing in merged processor (vllm-project#11620) Signed-off-by: DarkLight1337 <[email protected]> * [V1] [6/N] API Server: Better Shutdown (vllm-project#11586) * [Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseModel (vllm-project#11631) * [benchmark] Remove dependency for H100 benchmark step (vllm-project#11572) * [Model][LoRA]LoRA support added for MolmoForCausalLM (vllm-project#11439) Signed-off-by: Matthias Vogler <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Matthias Vogler <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Bugfix] Fix OpenAI parallel sampling when using xgrammar (vllm-project#11637) Signed-off-by: mgoin <[email protected]> * [Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) (vllm-project#6909) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. (vllm-project#11565) * [V1] Simpify vision block hash for prefix caching by removing offset from hash (vllm-project#11646) * [V1][VLM] V1 support for selected single-image models. (vllm-project#11632) Signed-off-by: Roger Wang <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [Benchmark] Add benchmark script for CPU offloading (vllm-project#11533) Signed-off-by: ApostaC <[email protected]> Co-authored-by: KuntaiDu <[email protected]> * [Bugfix][Refactor] Unify model management in frontend (vllm-project#11660) Signed-off-by: Joe Runde <[email protected]> * [VLM] Add max-count checking in data parser for single image models (vllm-project#11661) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Misc] Optimize Qwen2-VL LoRA test (vllm-project#11663) Signed-off-by: Jee Jee Li <[email protected]> * [Misc] Replace space with - in the file names (vllm-project#11667) Signed-off-by: Lu Fang <[email protected]> * [Doc] Fix typo (vllm-project#11666) Signed-off-by: Kazuhiro Serizawa <[email protected]> * [V1] Implement Cascade Attention (vllm-project#11635) Signed-off-by: Woosuk Kwon <[email protected]> * [VLM] Move supported limits and max tokens to merged multi-modal processor (vllm-project#11669) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [VLM][Bugfix] Multi-modal processor compatible with V1 multi-input (vllm-project#11674) Signed-off-by: DarkLight1337 <[email protected]> * [mypy] Pass type checking in vllm/inputs (vllm-project#11680) Signed-off-by: Tobias Pitters <[email protected]> * [VLM] Merged multi-modal processor for LLaVA-NeXT (vllm-project#11682) Signed-off-by: DarkLight1337 <[email protected]> * According to vllm.EngineArgs, the name should be distributed_executor_backend (vllm-project#11689) * [Bugfix] Free cross attention block table for preempted-for-recompute sequence group. (vllm-project#10013) Signed-off-by: Kathy Yu <[email protected]> * [V1][Minor] Optimize token_ids_cpu copy (vllm-project#11692) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix] Change kv scaling factor by param json on nvidia gpu (vllm-project#11688) Signed-off-by: bjmsong <[email protected]> Co-authored-by: bjmsong <[email protected]> * Resolve race conditions in Marlin kernel (vllm-project#11493) Signed-off-by: wchen61 <[email protected]> * [Misc] Minimum requirements for SageMaker compatibility (vllm-project#11576) * Update default max_num_batch_tokens for chunked prefill (vllm-project#11694) * [Bugfix] Check chain_speculative_sampling before calling it (vllm-project#11673) Signed-off-by: Lu Fang <[email protected]> * [perf-benchmark] Fix dependency for steps in benchmark pipeline (vllm-project#11710) * [Model] Whisper model implementation (vllm-project#11280) Co-authored-by: Aurick Qiao <[email protected]> * [V1] Simplify Shutdown (vllm-project#11659) * [Bugfix] Fix ColumnParallelLinearWithLoRA slice (vllm-project#11708) Signed-off-by: ZincCat <[email protected]> * [V1] Improve TP>1 Error Handling + Stack Trace (vllm-project#11721) Co-authored-by: Tyler Michael Smith <[email protected]> * [Misc]Add BNB quantization for Qwen2VL (vllm-project#11719) Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> * Update requirements-tpu.txt to support python 3.9 and 3.11 (vllm-project#11695) Signed-off-by: mgoin <[email protected]> * [V1] Chore: cruft removal (vllm-project#11724) * [V1] log GPU blocks num for MultiprocExecutor (vllm-project#11656) * Update tool_calling.md (vllm-project#11701) * Update bnb.md with example for OpenAI (vllm-project#11718) * [V1] Add `RayExecutor` support for `AsyncLLM` (api server) (vllm-project#11712) * [V1] Add kv cache utils tests. (vllm-project#11513) Signed-off-by: xcnick <[email protected]> * [Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture (vllm-project#11233) Signed-off-by: Yan Burman <[email protected]> Signed-off-by: Ido Asraff <[email protected]> * [VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-OneVision (vllm-project#11717) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix precision error in LLaVA-NeXT (vllm-project#11735) Signed-off-by: DarkLight1337 <[email protected]> * [Model] Remove unnecessary weight initialization logic (vllm-project#11736) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [Bugfix][V1] Fix test_kv_cache_utils.py (vllm-project#11738) Signed-off-by: Jee Jee Li <[email protected]> * [MISC] Replace c10::optional with std::optional (vllm-project#11730) Signed-off-by: Lu Fang <[email protected]> * [distributed] remove pynccl's redundant stream (vllm-project#11744) * fix: [doc] fix typo (vllm-project#11751) Co-authored-by: Lancer <[email protected]> * [Frontend] Improve `StreamingResponse` Exception Handling (vllm-project#11752) * [distributed] remove pynccl's redundant change_state (vllm-project#11749) * [Doc] [1/N] Reorganize Getting Started section (vllm-project#11645) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Remove block size constraint (vllm-project#11723) * [V1] Add BlockTable class (vllm-project#11693) Signed-off-by: Woosuk Kwon <[email protected]> * [Misc] Fix typo for valid_tool_parses (vllm-project#11753) Signed-off-by: Rui Qiao <[email protected]> * [V1] Refactor get_executor_cls (vllm-project#11754) * [mypy] Forward pass function type hints in lora (vllm-project#11740) Signed-off-by: lucast2021 <[email protected]> Co-authored-by: lucast2021 <[email protected]> * k8s-config: Update the secret to use stringData (vllm-project#11679) Signed-off-by: Suraj Deshmukh <[email protected]> * [VLM] Separate out profiling-related logic (vllm-project#11746) Signed-off-by: DarkLight1337 <[email protected]> * [Doc][2/N] Reorganize Models and Usage sections (vllm-project#11755) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix max image size for LLaVA-Onevision (vllm-project#11769) Signed-off-by: Roger Wang <[email protected]> * [doc] explain how to add interleaving sliding window support (vllm-project#11771) Signed-off-by: youkaichao <[email protected]> * [Bugfix][V1] Fix molmo text-only inputs (vllm-project#11676) Signed-off-by: Jee Jee Li <[email protected]> * [Kernel] Move attn_type to Attention.__init__() (vllm-project#11690) Signed-off-by: Chen Zhang <[email protected]> * format * [V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision (vllm-project#11685) Signed-off-by: Roger Wang <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * deepseek overflow fix (#349) * [Bugfix] Fix LLaVA-NeXT feature size precision error (for real) (vllm-project#11772) Signed-off-by: DarkLight1337 <[email protected]> * [Model] Future-proof Qwen2-Audio multi-modal processor (vllm-project#11776) Signed-off-by: DarkLight1337 <[email protected]> * [XPU] Make pp group initilized for pipeline-parallelism (vllm-project#11648) Signed-off-by: yisheng <[email protected]> * [Doc][3/N] Reorganize Serving section (vllm-project#11766) Signed-off-by: DarkLight1337 <[email protected]> * [Kernel][LoRA]Punica prefill kernels fusion (vllm-project#11234) Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Abatom <[email protected]> Co-authored-by: Zhonghua Deng <[email protected]> * [Bugfix] Update attention interface in `Whisper` (vllm-project#11784) Signed-off-by: Roger Wang <[email protected]> * [CI] Fix neuron CI and run offline tests (vllm-project#11779) Signed-off-by: Liangfu Chen <[email protected]> * fix init error for MessageQueue when n_local_reader is zero (vllm-project#11768) * [Doc] Create a vulnerability management team (vllm-project#9925) Signed-off-by: Russell Bryant <[email protected]> * [CI][CPU] adding build number to docker image name (vllm-project#11788) Signed-off-by: Yuan Zhou <[email protected]> * [V1][Doc] Update V1 support for `LLaVa-NeXT-Video` (vllm-project#11798) Signed-off-by: Roger Wang <[email protected]> * [Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation (vllm-project#11800) Signed-off-by: DarkLight1337 <[email protected]> * [doc] add doc to explain how to use uv (vllm-project#11773) Signed-off-by: youkaichao <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [V1] Support audio language models on V1 (vllm-project#11733) Signed-off-by: Roger Wang <[email protected]> * [doc] update how pip can install nightly wheels (vllm-project#11806) Signed-off-by: youkaichao <[email protected]> * [Doc] Add note to `gte-Qwen2` models (vllm-project#11808) Signed-off-by: DarkLight1337 <[email protected]> * [optimization] remove python function call for custom op (vllm-project#11750) Signed-off-by: youkaichao <[email protected]> * [Bugfix] update the prefix for qwen2 (vllm-project#11795) Co-authored-by: jiadi.jjd <[email protected]> * [Doc]Add documentation for using EAGLE in vLLM (vllm-project#11417) Signed-off-by: Sourashis Roy <[email protected]> * [Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 (vllm-project#11794) * [Doc] Group examples into categories (vllm-project#11782) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] Fix image input for Pixtral-HF (vllm-project#11741) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] sort torch profiler table by kernel timing (vllm-project#11813) * Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… (vllm-project#11824) * Fixed docker build for ppc64le (vllm-project#11518) Signed-off-by: Nishidha Panpaliya <[email protected]> * [OpenVINO] Fixed Docker.openvino build (vllm-project#11732) Signed-off-by: Ilya Lavrenov <[email protected]> * [Bugfix] Add checks for LoRA and CPU offload (vllm-project#11810) Signed-off-by: Jee Jee Li <[email protected]> * [Docs] reorganize sponsorship page (vllm-project#11639) Signed-off-by: simon-mo <[email protected]> * [Bug] Fix pickling of `ModelConfig` when RunAI Model Streamer is used (vllm-project#11825) Signed-off-by: DarkLight1337 <[email protected]> * [misc] improve memory profiling (vllm-project#11809) Signed-off-by: youkaichao <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [doc] update wheels url (vllm-project#11830) Signed-off-by: youkaichao <[email protected]> * [Docs] Update sponsor name: 'Novita' to 'Novita AI' (vllm-project#11833) * [Hardware][Apple] Native support for macOS Apple Silicon (vllm-project#11696) Signed-off-by: Wallas Santos <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [torch.compile] consider relevant code in compilation cache (vllm-project#11614) Signed-off-by: youkaichao <[email protected]> * [VLM] Reorganize profiling/processing-related code (vllm-project#11812) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Move examples into categories (vllm-project#11840) Signed-off-by: Harry Mellor <[email protected]> * [Doc][4/N] Reorganize API Reference (vllm-project#11843) Signed-off-by: DarkLight1337 <[email protected]> * [CI/Build][Bugfix] Fix CPU CI image clean up (vllm-project#11836) Signed-off-by: jiang1.li <[email protected]> * [Bugfix][XPU] fix silu_and_mul (vllm-project#11823) Signed-off-by: yan ma <[email protected]> * [Misc] Move some model utils into vision file (vllm-project#11848) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Expand Multimodal API Reference (vllm-project#11852) Signed-off-by: DarkLight1337 <[email protected]> * [Misc]add some explanations for BlockHashType (vllm-project#11847) * [TPU][Quantization] TPU `W8A8` (vllm-project#11785) Co-authored-by: Woosuk Kwon <[email protected]> * [Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models (vllm-project#11698) Signed-off-by: Randall Smith <[email protected]> * [Docs] Add Google Cloud Meetup (vllm-project#11864) * Revert nccl changes (#351) * Revert "[distributed] remove pynccl's redundant change_state (vllm-project#11749)" This reverts commit 9e764e7. * Revert "[distributed] remove pynccl's redundant stream (vllm-project#11744)" This reverts commit 635b897. * [CI] Turn on basic correctness tests for V1 (vllm-project#10864) * treat do_lower_case in the same way as the sentence-transformers library (vllm-project#11815) Signed-off-by: Max de Bayser <[email protected]> * [Doc] Recommend uv and python 3.12 for quickstart guide (vllm-project#11849) Signed-off-by: mgoin <[email protected]> * [Misc] Move `print_*_once` from utils to logger (vllm-project#11298) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]> Co-authored-by: Maxime Fournioux <[email protected]> * [Doc] Intended links Python multiprocessing library (vllm-project#11878) * [perf]fix current stream (vllm-project#11870) Signed-off-by: youkaichao <[email protected]> * [Bugfix] Override dunder methods of placeholder modules (vllm-project#11882) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] fix beam search input errors and latency benchmark script (vllm-project#11875) Signed-off-by: Ye Qi <[email protected]> Co-authored-by: yeq <[email protected]> * [Doc] Add model development API Reference (vllm-project#11884) Signed-off-by: DarkLight1337 <[email protected]> * [platform] Allow platform specify attention backend (vllm-project#11609) Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> * [ci]try to fix flaky multi-step tests (vllm-project#11894) Signed-off-by: youkaichao <[email protected]> * [Misc] Provide correct Pixtral-HF chat template (vllm-project#11891) Signed-off-by: DarkLight1337 <[email protected]> * fp8 support (#352) Co-authored-by: Yida Wu <[email protected]> * [Docs] Add Modal to deployment frameworks (vllm-project#11907) * [Doc][5/N] Move Community and API Reference to the bottom (vllm-project#11896) Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: Simon Mo <[email protected]> * [VLM] Enable tokenized inputs for merged multi-modal processor (vllm-project#11900) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Show default pooling method in a table (vllm-project#11904) Signed-off-by: DarkLight1337 <[email protected]> * [torch.compile] Hide KV cache behind torch.compile boundary (vllm-project#11677) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix] Validate lora adapters to avoid crashing server (vllm-project#11727) Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [BUGFIX] Fix `UnspecifiedPlatform` package name (vllm-project#11916) Signed-off-by: Kunshang Ji <[email protected]> * [ci] fix gh200 tests (vllm-project#11919) Signed-off-by: youkaichao <[email protected]> * [misc] remove python function call for custom activation op (vllm-project#11885) Co-authored-by: youkaichao <[email protected]> * [platform] support pytorch custom op pluggable (vllm-project#11328) Signed-off-by: wangxiyuan <[email protected]> * Replace "online inference" with "online serving" (vllm-project#11923) Signed-off-by: Harry Mellor <[email protected]> * [ci] Fix sampler tests (vllm-project#11922) Signed-off-by: youkaichao <[email protected]> * [Doc] [1/N] Initial guide for merged multi-modal processor (vllm-project#11925) Signed-off-by: DarkLight1337 <[email protected]> * [platform] support custom torch.compile backend key (vllm-project#11318) Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]> * [Doc] Rename offline inference examples (vllm-project#11927) Signed-off-by: Harry Mellor <[email protected]> * [Docs] Fix docstring in `get_ip` function (vllm-project#11932) Signed-off-by: Kuntai Du <[email protected]> * Doc fix in `benchmark_long_document_qa_throughput.py` (vllm-project#11933) Signed-off-by: Kuntai Du <[email protected]> * [Hardware][CPU] Support MOE models on x86 CPU (vllm-project#11831) Signed-off-by: jiang1.li <[email protected]> * [Misc] Clean up debug code in Deepseek-V3 (vllm-project#11930) Signed-off-by: Isotr0py <[email protected]> * [Misc] Update benchmark_prefix_caching.py fixed example usage (vllm-project#11920) Signed-off-by: Ren MinMin <[email protected]> Co-authored-by: Ren MinMin <[email protected]> * [Bugfix] Check that number of images matches number of <|image|> tokens with mllama (vllm-project#11939) Signed-off-by: Travis Johnson <[email protected]> * [mypy] Fix mypy warnings in api_server.py (vllm-project#11941) Signed-off-by: Fred Reiss <[email protected]> * [ci] fix broken distributed-tests-4-gpus (vllm-project#11937) Signed-off-by: youkaichao <[email protected]> * [Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design (vllm-project#11672) Signed-off-by: Sungjae Lee <[email protected]> * [Bugfix] fused_experts_impl wrong compute type for float32 (vllm-project#11921) Signed-off-by: shaochangxu.scx <[email protected]> Co-authored-by: shaochangxu.scx <[email protected]> * [CI/Build] Move model-specific multi-modal processing tests (vllm-project#11934) Signed-off-by: DarkLight1337 <[email protected]> * [Doc] Basic guide for writing unit tests for new models (vllm-project#11951) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix RobertaModel loading (vllm-project#11940) Signed-off-by: NickLucche <[email protected]> * [Model] Add cogagent model support vLLM (vllm-project#11742) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [V1] Avoid sending text prompt to core engine (vllm-project#11963) Signed-off-by: Roger Wang <[email protected]> * [CI/Build] Add markdown linter (vllm-project#11857) Signed-off-by: Rafael Vasquez <[email protected]> * [Model] Initialize support for Deepseek-VL2 models (vllm-project#11578) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Hardware][CPU] Multi-LoRA implementation for the CPU backend (vllm-project#11100) Signed-off-by: Akshat Tripathi <[email protected]> Signed-off-by: Oleg Mosalov <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Oleg Mosalov <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [Hardware][TPU] workaround fix for MoE on TPU (vllm-project#11764) * [V1][Core][1/n] Logging and Metrics (vllm-project#11962) Signed-off-by: [email protected] <[email protected]> * [Model] Support GGUF models newly added in `transformers` 4.46.0 (vllm-project#9685) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (vllm-project#11973) Signed-off-by: [email protected] <[email protected]> * [MISC] fix typo in kv transfer send recv test (vllm-project#11983) * [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (vllm-project#11979) * [CI][Spec Decode] fix: broken test for EAGLE model (vllm-project#11972) Signed-off-by: Sungjae Lee <[email protected]> * [Misc] Fix Deepseek V2 fp8 kv-scale remapping (vllm-project#11947) Signed-off-by: Yida Wu <[email protected]> * [Misc]Minor Changes about Worker (vllm-project#11555) Signed-off-by: Chenguang Li <[email protected]> * [platform] add ray_device_key (vllm-project#11948) Signed-off-by: youkaichao <[email protected]> * Fix Max Token ID for Qwen-VL-Chat (vllm-project#11980) Signed-off-by: Alex-Brooks <[email protected]> * [Kernel] unified_attention for Attention.forward (vllm-project#11967) Signed-off-by: Chen Zhang <[email protected]> * [Doc][V1] Update model implementation guide for V1 support (vllm-project#11998) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Doc] Organise installation documentation into categories and tabs (vllm-project#11935) Signed-off-by: Harry Mellor <[email protected]> * [platform] add device_control env var (vllm-project#12009) Signed-off-by: youkaichao <[email protected]> * [Platform] Move get_punica_wrapper() function to Platform (vllm-project#11516) Signed-off-by: Shanshan Shen <[email protected]> * bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (vllm-project#11982) Signed-off-by: elijah <[email protected]> * Using list * Revert "[misc] improve memory profiling (vllm-project#11809)" This reverts commit 889e662. * Multi-lingual P3L (#356) * Commiting the *multilingual* P3L test. * Created a *multi-lingual* P3L test. * Making ruff happy. * . * Added a reference to the language-scripture Confluence table. * Typo fixing. * Harmonizing naming. * Fixing comments in the header. --------- Co-authored-by: Alexei V. Ivanov <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> * Trying to make scales work with compileable attention * Docs lint * linter formatting bug fixes * inherit config file updates under fused_moe from main branch. * match tests for the MOE layers with main. --------- Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Jiaxin Shan <[email protected]> Signed-off-by: lucast2021 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: Alex He <[email protected]> Signed-off-by: ccjincong <[email protected]> Signed-off-by: Erez Schwartz <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: rajveerb <[email protected]> Signed-off-by: hjwei <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: KuntaiDu <[email protected]> Signed-off-by: Liangfu Chen <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: Matthias Vogler <[email protected]> Signed-off-by: ApostaC <[email protected]> Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Kazuhiro Serizawa <[email protected]> Signed-off-by: Tobias Pitters <[email protected]> Signed-off-by: Kathy Yu <[email protected]> Signed-off-by: bjmsong <[email protected]> Signed-off-by: wchen61 <[email protected]> Signed-off-by: ZincCat <[email protected]> Signed-off-by: xcnick <[email protected]> Signed-off-by: Yan Burman <[email protected]> Signed-off-by: Ido Asraff <[email protected]> Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: Suraj Deshmukh <[email protected]> Signed-off-by: yisheng <[email protected]> Signed-off-by: Abatom <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Yuan Zhou <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Ilya Lavrenov <[email protected]> Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: yan ma <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]> Signed-off-by: Ye Qi <[email protected]> Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: Kunshang Ji <[email protected]> Signed-off-by: Kuntai Du <[email protected]> Signed-off-by: Ren MinMin <[email protected]> Signed-off-by: Travis Johnson <[email protected]> Signed-off-by: Fred Reiss <[email protected]> Signed-off-by: Sungjae Lee <[email protected]> Signed-off-by: shaochangxu.scx <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: Rafael Vasquez <[email protected]> Signed-off-by: Akshat Tripathi <[email protected]> Signed-off-by: Oleg Mosalov <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Chenguang Li <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: Shanshan Shen <[email protected]> Signed-off-by: elijah <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Rui Qiao <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Lucas Tucker <[email protected]> Co-authored-by: lucast2021 <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: sroy745 <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: simon-mo <[email protected]> Co-authored-by: simon-mo <[email protected]> Co-authored-by: HandH1998 <[email protected]> Co-authored-by: robertgshaw2-neuralmagic <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> Co-authored-by: AlexHe99 <[email protected]> Co-authored-by: Chen1022 <[email protected]> Co-authored-by: ErezSC42 <[email protected]> Co-authored-by: Selali <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Rajveer Bachkaniwala <[email protected]> Co-authored-by: hj-wei <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Liangfu Chen <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: whyiug <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: Matthias Vogler <[email protected]> Co-authored-by: Matthias Vogler <[email protected]> Co-authored-by: John Giorgi <[email protected]> Co-authored-by: sakunkun <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Yihua Cheng <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Kazuhiro Serizawa <[email protected]> Co-authored-by: Tobias Pitters <[email protected]> Co-authored-by: Chunyang Wen <[email protected]> Co-authored-by: Kathy Yu <[email protected]> Co-authored-by: bjmsong <[email protected]> Co-authored-by: bjmsong <[email protected]> Co-authored-by: wchen61 <[email protected]> Co-authored-by: Nathan Azrak <[email protected]> Co-authored-by: Sachin Varghese <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Co-authored-by: ZincCat <[email protected]> Co-authored-by: WangErXiao <[email protected]> Co-authored-by: Hust_YangXian <[email protected]> Co-authored-by: Alberto Ferrer <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: xcnick <[email protected]> Co-authored-by: Yan Burman <[email protected]> Co-authored-by: cennn <[email protected]> Co-authored-by: Lancer <[email protected]> Co-authored-by: Lancer <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: Suraj Deshmukh <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: YiSheng5 <[email protected]> Co-authored-by: Zhonghua Deng <[email protected]> Co-authored-by: XiaobingZhang <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Yuan <[email protected]> Co-authored-by: jiangjiadi <[email protected]> Co-authored-by: jiadi.jjd <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Jie Fu (傅杰) <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Ilya Lavrenov <[email protected]> Co-authored-by: Wallas Henrique <[email protected]> Co-authored-by: Yan Ma <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Maxime Fournioux <[email protected]> Co-authored-by: Guspan Tanadi <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: yeq <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Co-authored-by: Yida Wu <[email protected]> Co-authored-by: Charles Frye <[email protected]> Co-authored-by: minmin <[email protected]> Co-authored-by: Ren MinMin <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Fred Reiss <[email protected]> Co-authored-by: Sungjae Lee <[email protected]> Co-authored-by: shaochangxu <[email protected]> Co-authored-by: shaochangxu.scx <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: sixgod <[email protected]> Co-authored-by: Rafael Vasquez <[email protected]> Co-authored-by: Akshat Tripathi <[email protected]> Co-authored-by: Oleg Mosalov <[email protected]> Co-authored-by: Avshalom Manevich <[email protected]> Co-authored-by: Yangcheng Li <[email protected]> Co-authored-by: Siyuan Li <[email protected]> Co-authored-by: Chenguang Li <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Shanshan Shen <[email protected]> Co-authored-by: elijah <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Alexei V. Ivanov <[email protected]> Co-authored-by: vllmellm <[email protected]>

Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: hzh <[email protected]>

Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: Bowen Wang <[email protected]>

Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: Isotr0py <[email protected]>

@t-parry

…ntion (#399) * [V1] Avoid sending text prompt to core engine (vllm-project#11963) Signed-off-by: Roger Wang <[email protected]> * [CI/Build] Add markdown linter (vllm-project#11857) Signed-off-by: Rafael Vasquez <[email protected]> * [Model] Initialize support for Deepseek-VL2 models (vllm-project#11578) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Hardware][CPU] Multi-LoRA implementation for the CPU backend (vllm-project#11100) Signed-off-by: Akshat Tripathi <[email protected]> Signed-off-by: Oleg Mosalov <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Oleg Mosalov <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [Hardware][TPU] workaround fix for MoE on TPU (vllm-project#11764) * [V1][Core][1/n] Logging and Metrics (vllm-project#11962) Signed-off-by: [email protected] <[email protected]> * [Model] Support GGUF models newly added in `transformers` 4.46.0 (vllm-project#9685) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (vllm-project#11973) Signed-off-by: [email protected] <[email protected]> * [MISC] fix typo in kv transfer send recv test (vllm-project#11983) * [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (vllm-project#11979) * [CI][Spec Decode] fix: broken test for EAGLE model (vllm-project#11972) Signed-off-by: Sungjae Lee <[email protected]> * [Misc] Fix Deepseek V2 fp8 kv-scale remapping (vllm-project#11947) Signed-off-by: Yida Wu <[email protected]> * [Misc]Minor Changes about Worker (vllm-project#11555) Signed-off-by: Chenguang Li <[email protected]> * [platform] add ray_device_key (vllm-project#11948) Signed-off-by: youkaichao <[email protected]> * Fix Max Token ID for Qwen-VL-Chat (vllm-project#11980) Signed-off-by: Alex-Brooks <[email protected]> * [Kernel] unified_attention for Attention.forward (vllm-project#11967) Signed-off-by: Chen Zhang <[email protected]> * [Doc][V1] Update model implementation guide for V1 support (vllm-project#11998) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Doc] Organise installation documentation into categories and tabs (vllm-project#11935) Signed-off-by: Harry Mellor <[email protected]> * [platform] add device_control env var (vllm-project#12009) Signed-off-by: youkaichao <[email protected]> * [Platform] Move get_punica_wrapper() function to Platform (vllm-project#11516) Signed-off-by: Shanshan Shen <[email protected]> * bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (vllm-project#11982) Signed-off-by: elijah <[email protected]> * [Doc] Fix build from source and installation link in README.md (vllm-project#12013) Signed-off-by: Yikun <[email protected]> * Using list * [Bugfix] Fix deepseekv3 gate bias error (vllm-project#12002) Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]> * Revert "[misc] improve memory profiling (vllm-project#11809)" This reverts commit 889e662. * Multi-lingual P3L (#356) * Commiting the *multilingual* P3L test. * Created a *multi-lingual* P3L test. * Making ruff happy. * . * Added a reference to the language-scripture Confluence table. * Typo fixing. * Harmonizing naming. * Fixing comments in the header. --------- Co-authored-by: Alexei V. Ivanov <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> * Trying to make scales work with compileable attention * [Docs] Add Sky Computing Lab to project intro (vllm-project#12019) Signed-off-by: Woosuk Kwon <[email protected]> * [HPU][Bugfix] set_forward_context and CI test execution (vllm-project#12014) Signed-off-by: Konrad Zawora <[email protected]> * [Doc] Update Quantization Hardware Support Documentation (vllm-project#12025) Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]> * [HPU][misc] add comments for explanation (vllm-project#12034) Signed-off-by: youkaichao <[email protected]> * [Bugfix] Fix various bugs in multi-modal processor (vllm-project#12031) Signed-off-by: DarkLight1337 <[email protected]> * [Kernel] Revert the API change of Attention.forward (vllm-project#12038) Signed-off-by: Chen Zhang <[email protected]> * [Platform] Add output for Attention Backend (vllm-project#11981) Signed-off-by: wangxiyuan <[email protected]> * [Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (vllm-project#12040) Signed-off-by: Chen Zhang <[email protected]> * Explain where the engine args go when using Docker (vllm-project#12041) Signed-off-by: Harry Mellor <[email protected]> * Docs lint * [Doc]: Update the Json Example of the `Engine Arguments` document (vllm-project#12045) * [Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping (vllm-project#11924) Signed-off-by: Jee Jee Li <[email protected]> * [Kernel] Support MulAndSilu (vllm-project#11624) Signed-off-by: Jee Jee Li <[email protected]> * [HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py (vllm-project#12046) Signed-off-by: Konrad Zawora <[email protected]> * [Platform] move current_memory_usage() into platform (vllm-project#11369) Signed-off-by: Shanshan Shen <[email protected]> * [V1][BugFix] Fix edge case in VLM scheduling (vllm-project#12065) Signed-off-by: Woosuk Kwon <[email protected]> * [Misc] Add multipstep chunked-prefill support for FlashInfer (vllm-project#10467) * [core] Turn off GPU communication overlap for Ray executor (vllm-project#12051) Signed-off-by: Rui Qiao <[email protected]> * [core] platform agnostic executor via collective_rpc (vllm-project#11256) Signed-off-by: youkaichao <[email protected]> * [Doc] Update examples to remove SparseAutoModelForCausalLM (vllm-project#12062) Signed-off-by: Kyle Sayers <[email protected]> * [V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager (vllm-project#12003) * Fix: cases with empty sparsity config (vllm-project#12057) Signed-off-by: Rahul Tuli <[email protected]> * Type-fix: make execute_model output type optional (vllm-project#12020) * [Platform] Do not raise error if _Backend is not found (vllm-project#12023) Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> * [Model]: Support internlm3 (vllm-project#12037) * Misc: allow to use proxy in `HTTPConnection` (vllm-project#12042) Signed-off-by: Yuan Zhou <[email protected]> * [Misc][Quark] Upstream Quark format to VLLM (vllm-project#10765) Signed-off-by: kewang-xlnx <[email protected]> Signed-off-by: kewang2 <[email protected]> Co-authored-by: kewang2 <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [Doc]: Update `OpenAI-Compatible Server` documents (vllm-project#12082) * [Bugfix] use right truncation for non-generative tasks (vllm-project#12050) Signed-off-by: Joe Runde <[email protected]> * [V1][Core] Autotune encoder cache budget (vllm-project#11895) Signed-off-by: Roger Wang <[email protected]> * [Bugfix] Fix _get_lora_device for HQQ marlin (vllm-project#12090) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Allow hip sources to be directly included when compiling for rocm. (vllm-project#12087) * [Core] Default to using per_token quantization for fp8 when cutlass is supported. (vllm-project#8651) Signed-off-by: mgoin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: mgoin <[email protected]> * [Doc] Add documentation for specifying model architecture (vllm-project#12105) * Various cosmetic/comment fixes (vllm-project#12089) Signed-off-by: mgoin <[email protected]> * [Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (vllm-project#12067) Signed-off-by: Isotr0py <[email protected]> * Support torchrun and SPMD-style offline inference (vllm-project#12071) Signed-off-by: youkaichao <[email protected]> * [core] LLM.collective_rpc interface and RLHF example (vllm-project#12084) Signed-off-by: youkaichao <[email protected]> * [Bugfix] Fix max image feature size for Llava-one-vision (vllm-project#12104) Signed-off-by: Roger Wang <[email protected]> * Enable user marker for vllm profiling (#357) * Enable user marker for vllm profiling --------- Co-authored-by: Gregory Shtrasberg <[email protected]> * [misc] Add LoRA kernel micro benchmarks (vllm-project#11579) * [Model] Add support for deepseek-vl2-tiny model (vllm-project#12068) Signed-off-by: Isotr0py <[email protected]> * Deepseek V3 support (#364) * Changing the hard coded datatype to see if it's enough for the model to work * Picking the upstrteam moe kernel version * make upstream fix for v3 also works for rocm v2 * Conditional fnuz dtype * Requantizing from fn to fnuz * Requantizing moe as well * Actually requantizing moe weights * Conditional requantization and assert on padding in block quant * Format --------- Co-authored-by: charlifu <[email protected]> * [Bugfix] Set enforce_eager automatically for mllama (vllm-project#12127) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix] Fix a path bug in disaggregated prefill example script. (vllm-project#12121) Signed-off-by: Kuntai Du <[email protected]> * [CI]add genai-perf benchmark in nightly benchmark (vllm-project#10704) Signed-off-by: Kunshang Ji <[email protected]> * [Doc] Add instructions on using Podman when SELinux is active (vllm-project#12136) Signed-off-by: Yuan Tang <[email protected]> * [Bugfix] Fix issues in CPU build Dockerfile (vllm-project#12135) Signed-off-by: Yuan Tang <[email protected]> * [BugFix] add more `is not None` check in VllmConfig.__post_init__ (vllm-project#12138) Signed-off-by: Chen Zhang <[email protected]> * [Misc] Add deepseek_vl2 chat template (vllm-project#12143) Signed-off-by: Isotr0py <[email protected]> * [ROCm][MoE] moe tuning support for rocm (vllm-project#12049) Signed-off-by: Divakar Verma <[email protected]> * [V1] Move more control of kv cache initialization from model_executor to EngineCore (vllm-project#11960) Signed-off-by: Chen Zhang <[email protected]> Co-authored-by: Cody Yu <[email protected]> * [Misc][LoRA] Improve the readability of LoRA error messages (vllm-project#12102) Signed-off-by: Jee Jee Li <[email protected]> * [CI/Build][CPU][Bugfix] Fix CPU CI (vllm-project#12150) Signed-off-by: jiang1.li <[email protected]> * [core] allow callable in collective_rpc (vllm-project#12151) Signed-off-by: youkaichao <[email protected]> * [Bugfix] Fix score api for missing max_model_len validation (vllm-project#12119) Signed-off-by: Wallas Santos <[email protected]> * [Bugfix] Mistral tokenizer encode accept list of str (vllm-project#12149) Signed-off-by: Kunshang Ji <[email protected]> * [AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (vllm-project#12134) Signed-off-by: Gregory Shtrasberg <[email protected]> * [torch.compile] disable logging when cache is disabled (vllm-project#12043) Signed-off-by: youkaichao <[email protected]> * [misc] fix cross-node TP (vllm-project#12166) Signed-off-by: youkaichao <[email protected]> * [AMD][CI/Build][Bugfix] use pytorch stale wheel (vllm-project#12172) Signed-off-by: hongxyan <[email protected]> * [core] further polish memory profiling (vllm-project#12126) Signed-off-by: youkaichao <[email protected]> * [Docs] Fix broken link in SECURITY.md (vllm-project#12175) Signed-off-by: Russell Bryant <[email protected]> * [Model] Port deepseek-vl2 processor, remove dependency (vllm-project#12169) Signed-off-by: Isotr0py <[email protected]> * [core] clean up executor class hierarchy between v1 and v0 (vllm-project#12171) Signed-off-by: youkaichao <[email protected]> * [Misc] Support register quantization method out-of-tree (vllm-project#11969) * [V1] Collect env var for usage stats (vllm-project#12115) * [BUGFIX] Move scores to float32 in case of running xgrammar on cpu (vllm-project#12152) Signed-off-by: Michal Adamczyk <[email protected]> * [Bugfix] Fix multi-modal processors for transformers 4.48 (vllm-project#12187) * [torch.compile] store inductor compiled Python file (vllm-project#12182) Signed-off-by: youkaichao <[email protected]> * benchmark_serving support --served-model-name param (vllm-project#12109) Signed-off-by: zibai <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Misc] Add BNB support to GLM4-V model (vllm-project#12184) Signed-off-by: Isotr0py <[email protected]> * [V1] Add V1 support of Qwen2-VL (vllm-project#12128) Signed-off-by: Roger Wang <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: imkero <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * [Model] Support for fairseq2 Llama (vllm-project#11442) Signed-off-by: Martin Gleize <[email protected]> Co-authored-by: mgleize user <[email protected]> * [Bugfix] Fix num_heads value for simple connector when tp enabled (vllm-project#12074) Signed-off-by: Shangming Cai <[email protected]> * [torch.compile] fix sym_tensor_indices (vllm-project#12191) Signed-off-by: youkaichao <[email protected]> * Move linting to `pre-commit` (vllm-project#11975) Signed-off-by: Harry Mellor <[email protected]> * [DOC] Fix typo in docstring and assert message (vllm-project#12194) Signed-off-by: Yuan Tang <[email protected]> * [DOC] Add missing docstring in LLMEngine.add_request() (vllm-project#12195) Signed-off-by: Yuan Tang <[email protected]> * [Bugfix] Fix incorrect types in LayerwiseProfileResults (vllm-project#12196) Signed-off-by: Yuan Tang <[email protected]> * [Model] Add Qwen2 PRM model support (vllm-project#12202) Signed-off-by: Isotr0py <[email protected]> * [Core] Interface for accessing model from `VllmRunner` (vllm-project#10353) Signed-off-by: DarkLight1337 <[email protected]> * [misc] add placeholder format.sh (vllm-project#12206) Signed-off-by: youkaichao <[email protected]> * [CI/Build] Remove dummy CI steps (vllm-project#12208) Signed-off-by: DarkLight1337 <[email protected]> * [CI/Build] Make pre-commit faster (vllm-project#12212) Signed-off-by: DarkLight1337 <[email protected]> * [Model] Upgrade Aria to transformers 4.48 (vllm-project#12203) Signed-off-by: DarkLight1337 <[email protected]> * [misc] print a message to suggest how to bypass commit hooks (vllm-project#12217) Signed-off-by: youkaichao <[email protected]> * [core][bugfix] configure env var during import vllm (vllm-project#12209) Signed-off-by: youkaichao <[email protected]> * [V1] Remove `_get_cache_block_size` (vllm-project#12214) Signed-off-by: Chen Zhang <[email protected]> * [Misc] Pass `attention` to impl backend (vllm-project#12218) Signed-off-by: wangxiyuan <[email protected]> * [Bugfix] Fix `HfExampleModels.find_hf_info` (vllm-project#12223) Signed-off-by: DarkLight1337 <[email protected]> * [CI] Pass local python version explicitly to pre-commit mypy.sh (vllm-project#12224) Signed-off-by: Chen Zhang <[email protected]> * Using ROCm6.3.1 base docker and building hipblas-common (#366) * [Misc] Update CODEOWNERS (vllm-project#12229) * fix: update platform detection for M-series arm based MacBook processors (vllm-project#12227) Signed-off-by: isikhi <[email protected]> * [misc] add cuda runtime version to usage data (vllm-project#12190) Signed-off-by: youkaichao <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [bugfix] catch xgrammar unsupported array constraints (vllm-project#12210) Signed-off-by: Jason Cheng <[email protected]> * [Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) (vllm-project#12222) Signed-off-by: Jinzhen Lin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * Add quantization and guided decoding CODEOWNERS (vllm-project#12228) Signed-off-by: mgoin <[email protected]> * [AMD][Build] Porting dockerfiles from the ROCm/vllm fork (vllm-project#11777) Signed-off-by: Gregory Shtrasberg <[email protected]> * [BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (vllm-project#12230) Signed-off-by: NickLucche <[email protected]> * [ci/build] disable failed and flaky tests (vllm-project#12240) Signed-off-by: youkaichao <[email protected]> * [Misc] Rename `MultiModalInputsV2 -> MultiModalInputs` (vllm-project#12244) Signed-off-by: DarkLight1337 <[email protected]> * [Misc]Add BNB quantization for PaliGemmaForConditionalGeneration (vllm-project#12237) Signed-off-by: Jee Jee Li <[email protected]> * [Misc] Remove redundant TypeVar from base model (vllm-project#12248) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix mm_limits access for merged multi-modal processor (vllm-project#12252) Signed-off-by: DarkLight1337 <[email protected]> * [torch.compile] transparent compilation with more logging (vllm-project#12246) Signed-off-by: youkaichao <[email protected]> * [V1][Bugfix] Fix data item ordering in mixed-modality inference (vllm-project#12259) Signed-off-by: Roger Wang <[email protected]> * Remove pytorch comments for outlines + compressed-tensors (vllm-project#12260) Signed-off-by: Thomas Parnell <[email protected]> * [Platform] improve platforms getattr (vllm-project#12264) Signed-off-by: Mengqing Cao <[email protected]> * [ci/build] update nightly torch for gh200 test (vllm-project#12270) Signed-off-by: youkaichao <[email protected]> * [Bugfix] fix race condition that leads to wrong order of token returned (vllm-project#10802) Signed-off-by: Jannis Schönleber <[email protected]> * [Kernel] fix moe_align_block_size error condition (vllm-project#12239) Signed-off-by: Jinzhen Lin <[email protected]> * [v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (vllm-project#10907) Signed-off-by: rickyx <[email protected]> * [Bugfix] Multi-sequence broken (vllm-project#11898) Signed-off-by: Andy Lo <[email protected]> * [Misc] Remove experimental dep from tracing.py (vllm-project#12007) Signed-off-by: Adrian Cole <[email protected]> * [Misc] Set default backend to SDPA for get_vit_attn_backend (vllm-project#12235) Signed-off-by: wangxiyuan <[email protected]> * [Core] Free CPU pinned memory on environment cleanup (vllm-project#10477) * Update pre-commit.yml (#374) * Update pre-commit.yml * Reapplying missing format * New codespell exclude location --------- Co-authored-by: Kevin H. Luu <[email protected]> * [bugfix] moe tuning. rm is_navi() (vllm-project#12273) Signed-off-by: Divakar Verma <[email protected]> * [BUGFIX] When skip_tokenize_init and multistep are set, execution crashes (vllm-project#12277) Signed-off-by: maleksan85 <[email protected]> Co-authored-by: maleksan85 <[email protected]> * [Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose (vllm-project#12281) Signed-off-by: Hongxia Yang <[email protected]> * [VLM] Simplify post-processing of replacement info (vllm-project#12269) Signed-off-by: DarkLight1337 <[email protected]> * [ci/lint] Add back default arg for pre-commit (vllm-project#12279) Signed-off-by: kevin <[email protected]> * [CI] add docker volume prune to neuron CI (vllm-project#12291) Signed-off-by: Liangfu Chen <[email protected]> * [Ci/Build] Fix mypy errors on main (vllm-project#12296) Signed-off-by: DarkLight1337 <[email protected]> * [Benchmark] More accurate TPOT calc in `benchmark_serving.py` (vllm-project#12288) Signed-off-by: Nick Hill <[email protected]> * [core] separate builder init and builder prepare for each batch (vllm-project#12253) Signed-off-by: youkaichao <[email protected]> * [Build] update requirements of no-device (vllm-project#12299) Signed-off-by: Mengqing Cao <[email protected]> * [Core] Support fully transparent sleep mode (vllm-project#11743) Signed-off-by: youkaichao <[email protected]> * [VLM] Avoid unnecessary tokenization (vllm-project#12310) Signed-off-by: DarkLight1337 <[email protected]> * [Model][Bugfix]: correct Aria model output (vllm-project#12309) Signed-off-by: xffxff <[email protected]> * [Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 (vllm-project#12313) Signed-off-by: Roger Wang <[email protected]> * [Doc] Add docs for prompt replacement (vllm-project#12318) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Fix the error in the tip for the --lora-modules parameter (vllm-project#12319) Signed-off-by: wangerxiao <[email protected]> * [Misc] Improve the readability of BNB error messages (vllm-project#12320) Signed-off-by: Jee Jee Li <[email protected]> * Skip tokenize/detokenize when it is disabled by arg --skip-tokenizer-init (#367) * switching detokenize flag to be False * detokenize = False for benchmarks * restoring default in main vllm code for detokenize * removing extra spaces * moving detokenize to flag * adding support for token ids --------- Co-authored-by: maleksan85 <[email protected]> * [Bugfix] Fix HPU multiprocessing executor (vllm-project#12167) Signed-off-by: Konrad Zawora <[email protected]> * [Core] Support `reset_prefix_cache` (vllm-project#12284) * [Frontend][V1] Online serving performance improvements (vllm-project#12287) * [AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (vllm-project#12282) Signed-off-by: Randall Smith <[email protected]> * FP8 FA fixes (#381) * FP8 FA fixes Summary: Add missing clamp and fix reciprocal scale computation. * linter * Returning the use of the proper stream in allreduce (#382) * [Bugfix] Fixing AMD LoRA CI test. (vllm-project#12329) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Docs] Update FP8 KV Cache documentation (vllm-project#12238) Signed-off-by: mgoin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Docs] Document vulnerability disclosure process (vllm-project#12326) Signed-off-by: Russell Bryant <[email protected]> * [V1] Add `uncache_blocks` (vllm-project#12333) * [doc] explain common errors around torch.compile (vllm-project#12340) Signed-off-by: youkaichao <[email protected]> * [Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update (vllm-project#12338) Signed-off-by: zhenwei <[email protected]> * [Bugfix] Fix k_proj's bias for whisper self attention (vllm-project#12342) Signed-off-by: Isotr0py <[email protected]> * [Kernel] Flash Attention 3 Support (vllm-project#12093) Signed-off-by: Lucas Wilkinson <[email protected]> * [Doc] Troubleshooting errors during model inspection (vllm-project#12351) Signed-off-by: DarkLight1337 <[email protected]> * [V1] Simplify M-RoPE (vllm-project#12352) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: imkero <[email protected]> * [Bugfix] Fix broken internvl2 inference with v1 (vllm-project#12360) Signed-off-by: Isotr0py <[email protected]> * [core] add wake_up doc and some sanity check (vllm-project#12361) Signed-off-by: youkaichao <[email protected]> * [torch.compile] decouple compile sizes and cudagraph sizes (vllm-project#12243) Signed-off-by: youkaichao <[email protected]> * [FP8][Kernel] Dynamic kv cache scaling factors computation (vllm-project#11906) Signed-off-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Micah Williamson <[email protected]> * [TPU] Update TPU CI to use torchxla nightly on 20250122 (vllm-project#12334) Signed-off-by: Siyuan Liu <[email protected]> * [Docs] Document Phi-4 support (vllm-project#12362) Signed-off-by: Isotr0py <[email protected]> * [BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order (vllm-project#11528) Signed-off-by: ElizaWszola <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (vllm-project#12357) Signed-off-by: Junichi Sato <[email protected]> * [Docs] Add meetup slides (vllm-project#12345) Signed-off-by: Woosuk Kwon <[email protected]> * Using pytorch commit past the point when rowwise PR (pytorch/pytorch#144432) was merged (#384) * [Docs] Update spec decode + structured output in compat matrix (vllm-project#12373) Signed-off-by: Russell Bryant <[email protected]> * [V1][Frontend] Coalesce bunched `RequestOutput`s (vllm-project#12298) Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * Set weights_only=True when using torch.load() (vllm-project#12366) Signed-off-by: Russell Bryant <[email protected]> * [Bugfix] Path join when building local path for S3 clone (vllm-project#12353) Signed-off-by: Omer Dayan (SW-GPU) <[email protected]> * Update compressed-tensors version (vllm-project#12367) * [V1] Increase default batch size for H100/H200 (vllm-project#12369) Signed-off-by: Woosuk Kwon <[email protected]> * [perf] fix perf regression from vllm-project#12253 (vllm-project#12380) Signed-off-by: youkaichao <[email protected]> * [Misc] Use VisionArena Dataset for VLM Benchmarking (vllm-project#12389) Signed-off-by: Roger Wang <[email protected]> * [ci/build] fix wheel size check (vllm-project#12396) Signed-off-by: youkaichao <[email protected]> * [Hardware][Gaudi][Doc] Add missing step in setup instructions (vllm-project#12382) * [ci/build] sync default value for wheel size (vllm-project#12398) Signed-off-by: youkaichao <[email protected]> * [Misc] Enable proxy support in benchmark script (vllm-project#12356) Signed-off-by: Junichi Sato <[email protected]> * [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (vllm-project#12375) Signed-off-by: Lucas Wilkinson <[email protected]> * Applying scales rename to fp8 config (#387) * [Misc] Remove deprecated code (vllm-project#12383) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). (vllm-project#12405) Signed-off-by: Lucas Wilkinson <[email protected]> * Dev-docker Documentation Updates (#378) * Dev-docker Documentation Updates Minor updates to several sections, with links to other documents where appropriate. * Fix formatting of GEMM filename * README cleanup - Reorder some sections of the README to make them easier to follow - Improve formatting of bash commands - Prefer use of huggingface model names instead of hard-coded directories - Clean up wording * Expanded sample commands for Latency and Throughput * Fix markdown links * Fix pre-commit errors * Updates from review Initial updates to incorporate feedback from a review session held with @t-parry * Update script args to match current recommendations * Remove recommended max-num-seqs values for now --------- Co-authored-by: Gregory Shtrasberg <[email protected]> * [Bugfix][Kernel] Fix moe align block issue for mixtral (vllm-project#12413) * [Bugfix] Fix BLIP-2 processing (vllm-project#12412) Signed-off-by: DarkLight1337 <[email protected]> * [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (vllm-project#12408) Signed-off-by: Divakar Verma <[email protected]> * [Misc] Add FA2 support to ViT MHA layer (vllm-project#12355) Signed-off-by: Isotr0py <[email protected]> * [TPU][CI] Update torchxla version in requirement-tpu.txt (vllm-project#12422) Signed-off-by: Siyuan Liu <[email protected]> * [Misc][Bugfix] FA3 support to ViT MHA layer (vllm-project#12435) Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [V1][Perf] Reduce scheduling overhead in model runner after cuda sync (vllm-project#12094) Signed-off-by: Keyun Tong <[email protected]> * [V1][Bugfix] Fix assertion when mm hashing is turned off (vllm-project#12439) Signed-off-by: Roger Wang <[email protected]> * [Misc] Revert FA on ViT vllm-project#12355 and vllm-project#12435 (vllm-project#12445) * [Frontend] generation_config.json for maximum tokens(vllm-project#12242) Signed-off-by: Matthew Hendrey <[email protected]> Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: wangxiyuan <[email protected]> Co-authored-by: shangmingc <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: wangxiyuan <[email protected]> * [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (vllm-project#12417) Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: mgoin <[email protected]> * [Bugfix/CI] Fix broken kernels/test_mha.py (vllm-project#12450) * [Bugfix][Kernel] Fix perf regression caused by PR vllm-project#12405 (vllm-project#12434) Signed-off-by: Lucas Wilkinson <[email protected]> * [Build/CI] Fix libcuda.so linkage (vllm-project#12424) Signed-off-by: Tyler Michael Smith <[email protected]> * [Frontend] Rerank API (Jina- and Cohere-compatible API) (vllm-project#12376) Signed-off-by: Kyle Mistele <[email protected]> * [DOC] Add link to vLLM blog (vllm-project#12460) Signed-off-by: Yuan Tang <[email protected]> * [V1] Avoid list creation in input preparation (vllm-project#12457) Signed-off-by: Woosuk Kwon <[email protected]> * [Frontend] Support scores endpoint in run_batch (vllm-project#12430) Signed-off-by: Pooya Davoodi <[email protected]> * [Bugfix] Fix Granite 3.0 MoE model loading (vllm-project#12446) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix missing seq_start_loc in xformers prefill metadata (vllm-project#12464) Signed-off-by: Isotr0py <[email protected]> * [V1][Minor] Minor optimizations for update_from_output (vllm-project#12454) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix] Fix gpt2 GGUF inference (vllm-project#12467) Signed-off-by: Isotr0py <[email protected]> * [Build] Only build 9.0a for scaled_mm and sparse kernels (vllm-project#12339) Signed-off-by: Lucas Wilkinson <[email protected]> * [V1][Metrics] Add initial Prometheus logger (vllm-project#12416) Signed-off-by: Mark McLoughlin <[email protected]> * [V1][CI/Test] Do basic test for top-p & top-k sampling (vllm-project#12469) Signed-off-by: Woosuk Kwon <[email protected]> * [FlashInfer] Upgrade to 0.2.0 (vllm-project#11194) Signed-off-by: Bowen Wang <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]> * Support FP8 FA from Quark format (#388) * Support FP8 FA from Quark format * Support FP8 FA from Quark format * nit: update comment * Direct call on ROCm * 20250127 docs update (#392) * updating code blocks * typo * updated manifest * Including feedback * whitespace * Deepseek instructions * hyperlink fix * hyperlink fix * updating what is new * cpx update * typo * whitespace * whitespace * Faster Custom Paged Attention kernels (#372) * integrate new cpa kernel, update tests and benchmark * added comments to mfma4 kernel * further comments for mfma16 kernel * clang-format * Lint * add flag for logits rtz conversion and disable by default * lint * [Bugfix]: Fix paged attention unit tests of #372 (#389) * [Bugfix]: fix paged attention tests based on the updated kernels in `csrc/attention/paged_attention_v1.cu`,`csrc/attention/paged_attention_v2.cu` and `csrc/rocm/attention.cu`. * improve code documentation. * lint --------- Co-authored-by: vllmellm <[email protected]> --------- Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Joe Shajrawi <[email protected]> Co-authored-by: TJian <[email protected]> Co-authored-by: vllmellm <[email protected]> * Using a more precise profiling on ROCm to properly account for weights padding (#394) * Update Dockerfile.rocm * [Bugfix]: inclucde the env variables required for running FastSyncLLM Signed-off-by: vllmellm <[email protected]> * fix pre-commit lint Signed-off-by: vllmellm <[email protected]> --------- Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Rafael Vasquez <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Akshat Tripathi <[email protected]> Signed-off-by: Oleg Mosalov <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Sungjae Lee <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Chenguang Li <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Shanshan Shen <[email protected]> Signed-off-by: elijah <[email protected]> Signed-off-by: Yikun <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: tjtanaa <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: yisheng <[email protected]> Signed-off-by: Abatom <[email protected]> Signed-off-by: Liangfu Chen <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Yuan Zhou <[email protected]> Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Ilya Lavrenov <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: yan ma <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]> Signed-off-by: Ye Qi <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Kunshang Ji <[email protected]> Signed-off-by: Kuntai Du <[email protected]> Signed-off-by: Ren MinMin <[email protected]> Signed-off-by: Travis Johnson <[email protected]> Signed-off-by: Fred Reiss <[email protected]> Signed-off-by: shaochangxu.scx <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: Rahul Tuli <[email protected]> Signed-off-by: kewang-xlnx <[email protected]> Signed-off-by: kewang2 <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: Divakar Verma <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: hongxyan <[email protected]> Signed-off-by: Michal Adamczyk <[email protected]> Signed-off-by: zibai <[email protected]> Signed-off-by: Martin Gleize <[email protected]> Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: isikhi <[email protected]> Signed-off-by: Jason Cheng <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]> Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Jannis Schönleber <[email protected]> Signed-off-by: rickyx <[email protected]> Signed-off-by: Andy Lo <[email protected]> Signed-off-by: Adrian Cole <[email protected]> Signed-off-by: maleksan85 <[email protected]> Signed-off-by: Hongxia Yang <[email protected]> Signed-off-by: kevin <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: xffxff <[email protected]> Signed-off-by: wangerxiao <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: zhenwei <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Junichi Sato <[email protected]> Signed-off-by: Omer Dayan (SW-GPU) <[email protected]> Signed-off-by: Keyun Tong <[email protected]> Signed-off-by: Matthew Hendrey <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Kyle Mistele <[email protected]> Signed-off-by: Pooya Davoodi <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: Bowen Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Rafael Vasquez <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Akshat Tripathi <[email protected]> Co-authored-by: Oleg Mosalov <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Avshalom Manevich <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yangcheng Li <[email protected]> Co-authored-by: Siyuan Li <[email protected]> Co-authored-by: Sungjae Lee <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Chenguang Li <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Shanshan Shen <[email protected]> Co-authored-by: elijah <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Steve Luo <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Alexei V. Ivanov <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Co-authored-by: maang-h <[email protected]> Co-authored-by: YiSheng5 <[email protected]> Co-authored-by: Zhonghua Deng <[email protected]> Co-authored-by: Liangfu Chen <[email protected]> Co-authored-by: XiaobingZhang <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Yuan <[email protected]> Co-authored-by: jiangjiadi <[email protected]> Co-authored-by: jiadi.jjd <[email protected]> Co-authored-by: sroy745 <[email protected]> Co-authored-by: Jie Fu (傅杰) <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: WangErXiao <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Ilya Lavrenov <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Wallas Henrique <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Yan Ma <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Maxime Fournioux <[email protected]> Co-authored-by: Guspan Tanadi <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: yeq <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> Co-authored-by: Charles Frye <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: cennn <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: minmin <[email protected]> Co-authored-by: Ren MinMin <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Fred Reiss <[email protected]> Co-authored-by: shaochangxu <[email protected]> Co-authored-by: shaochangxu.scx <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: sixgod <[email protected]> Co-authored-by: Elfie Guo <[email protected]> Co-authored-by: Rui Qiao <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: Rahul Tuli <[email protected]> Co-authored-by: Keyun Tong <[email protected]> Co-authored-by: RunningLeon <[email protected]> Co-authored-by: kewang-xlnx <[email protected]> Co-authored-by: kewang2 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: tvirolai-amd <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Zhaoyi Li <[email protected]> Co-authored-by: charlifu <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: yancong <[email protected]> Co-authored-by: Michal Adamczyk <[email protected]> Co-authored-by: gujing <[email protected]> Co-authored-by: imkero <[email protected]> Co-authored-by: Martin Gleize <[email protected]> Co-authored-by: mgleize user <[email protected]> Co-authored-by: shangmingc <[email protected]> Co-authored-by: Işık <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Cheng Kuan Yong Jason <[email protected]> Co-authored-by: Jinzhen Lin <[email protected]> Co-authored-by: Thomas Parnell <[email protected]> Co-authored-by: Jannis Schönleber <[email protected]> Co-authored-by: Ricky Xu <[email protected]> Co-authored-by: Andy Lo <[email protected]> Co-authored-by: Adrian Cole <[email protected]> Co-authored-by: Jani Monoses <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: maleksan85 <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: zhou fan <[email protected]> Co-authored-by: ilia-cher <[email protected]> Co-authored-by: liuzhenwei <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Micah Williamson <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: Junichi Sato <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: omer-dayan <[email protected]> Co-authored-by: Mohit Deopujari <[email protected]> Co-authored-by: Jeremy Arnold <[email protected]> Co-authored-by: Matthew Hendrey <[email protected]> Co-authored-by: Kyle Mistele <[email protected]> Co-authored-by: Pooya Davoodi <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Bowen Wang <[email protected]> Co-authored-by: Bowen Bao <[email protected]> Co-authored-by: arakowsk-amd <[email protected]> Co-authored-by: sanyalington <[email protected]> Co-authored-by: Joe Shajrawi <[email protected]> Co-authored-by: vllmellm <[email protected]>

heibaidaolx123 · 2025-02-04T05:33:46Z

@Isotr0py
Hi, I use official image vllm-openai:v0.7.1 to serve deepseekvl2, the command:

docker run -d --rm --network=host --name=vllm-deepseekvl2 \
 --gpus all \
 -v $MODEL:/model \
 $IMAGE \
 --model /model \
 -tp 4 \
 --disable-log-stats \
 --port 23333 \
 --hf_overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}' \
 --max-model-len 4096 \
 --gpu-memory-utilization 0.95 \
 --max-num-batched-tokens 8192  \
 --max-num-seqs 1 \
 --max-logprobs 1

error log:

(VllmWorkerProcess pid=1012) INFO 02-03 21:28:48 model_runner.py:1111] Starting to load model /model...
(VllmWorkerProcess pid=1014) INFO 02-03 21:28:48 model_runner.py:1111] Starting to load model /model...
(VllmWorkerProcess pid=1013) INFO 02-03 21:28:48 model_runner.py:1111] Starting to load model /model...
(VllmWorkerProcess pid=1013) INFO 02-03 21:28:49 config.py:2974] cudagraph sizes specified by model runner [1] is overridden by config [1]
(VllmWorkerProcess pid=1013) INFO 02-03 21:28:49 config.py:3257] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
(VllmWorkerProcess pid=1012) INFO 02-03 21:28:49 config.py:2974] cudagraph sizes specified by model runner [1] is overridden by config [1]
(VllmWorkerProcess pid=1012) INFO 02-03 21:28:49 config.py:3257] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
(VllmWorkerProcess pid=1014) INFO 02-03 21:28:49 config.py:2974] cudagraph sizes specified by model runner [1] is overridden by config [1]
(VllmWorkerProcess pid=1014) INFO 02-03 21:28:49 config.py:3257] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
INFO 02-03 21:28:49 config.py:2974] cudagraph sizes specified by model runner [1] is overridden by config [1]
INFO 02-03 21:28:49 config.py:3257] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240] Exception in worker VllmWorkerProcess while processing method load_model.
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240] Traceback (most recent call last):
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_worker_utils.py", line 234, in _run_worker_process
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2208, in run_method
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 182, in load_model
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     self.model_runner.load_model()
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1113, in load_model
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     self.model = get_model(vllm_config=self.vllm_config)
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     return loader.load_model(vllm_config=vllm_config)
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 377, in load_model
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     model = _initialize_model(vllm_config=vllm_config)
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 119, in _initialize_model
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     return model_class(vllm_config=vllm_config, prefix=prefix)
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_vl2.py", line 343, in __init__
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     self.language_model = init_vllm_registered_model(
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 258, in init_vllm_registered_model
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     return _initialize_model(vllm_config=vllm_config, prefix=prefix)
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 119, in _initialize_model
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     return model_class(vllm_config=vllm_config, prefix=prefix)
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 660, in __init__
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     self.model = DeepseekV3Model(vllm_config=vllm_config,
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 594, in __init__
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     self.start_layer, self.end_layer, self.layers = make_layers(
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]                                                     ^^^^^^^^^^^^
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 556, in make_layers
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 596, in <lambda>
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     lambda prefix: DeepseekV3DecoderLayer(
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]                    ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 502, in __init__
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     self.self_attn = attn_cls(
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]                      ^^^^^^^^^
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 422, in __init__
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     rope_scaling["rope_type"] = 'deepseek_yarn'
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240]     ~~~~~~~~~~~~^^^^^^^^^^^^^
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240] TypeError: 'NoneType' object does not support item assignment

Any idea?
Thanks.

Isotr0py · 2025-02-04T06:31:01Z

(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 422, in init
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240] rope_scaling["rope_type"] = 'deepseek_yarn'

Oh, Deepseek-V3's MLA attention hasn't supported normal RoPE without yarn, let me make a PR to fix it.

haonan-li · 2025-02-04T18:32:25Z

(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 422, in init
(VllmWorkerProcess pid=1012) ERROR 02-03 21:28:49 multiproc_worker_utils.py:240] rope_scaling["rope_type"] = 'deepseek_yarn'

Oh, Deepseek-V3's MLA attention hasn't supported normal RoPE without yarn, let me make a PR to fix it.

Any updates on this, I met the same issue.

heibaidaolx123 · 2025-02-05T08:18:36Z

@Isotr0py
I tried the latest wheel to serve deepseekvl2, the server startup completed. Thanks for your pr.
But deepseekvl2 has no chat template, so i cannot use chat api, and completion api doesn't support image input.
Is there a way to walk around?

Isotr0py · 2025-02-05T08:25:00Z

@heibaidaolx123 There is a jinja2 chat template for deepseekvl2 in example folder:

vllm/examples/template_deepseek_vl2.jinja

Lines 1 to 23 in 022bcc7

    
           {%- if messages[0]['role'] == 'system' -%} 
        
               {%- set system_message = messages[0]['content'] -%} 
        
               {%- set messages = messages[1:] -%} 
        
           {%- else -%} 
        
               {% set system_message = '' -%} 
        
           {%- endif -%} 
        
           {{ bos_token + system_message }} 
        
           {%- for message in messages -%} 
        
               {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%} 
        
                   {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }} 
        
               {%- endif -%} 
        
               {%- if message['role'] == 'user' -%} 
        
                   {{ '<|User|>: ' + message['content'] + '\n' }} 
        
               {%- elif message['role'] == 'assistant' -%} 
        
                   {{ '<|Assistant|>: ' + message['content'] + eos_token + '\n' }} 
        
               {%- endif -%} 
        
           {%- endfor -%} 
        
           {%- if add_generation_prompt -%} 
        
               {{ '<|Assistant|>: ' }} 
        
           {% endif %}

You can add --chat-template examples/template_deepseek_vl2.jinja when launching the server to use the template.
(Not sure if the path is correct in docker image though 😂 )

heibaidaolx123 · 2025-02-05T09:14:09Z

@heibaidaolx123 There is a jinja2 chat template for deepseekvl2 in example folder:

vllm/examples/template_deepseek_vl2.jinja

Lines 1 to 23 in 022bcc7

{%- if messages[0]['role'] == 'system' -%}

{%- set system_message = messages[0]['content'] -%}

{%- set messages = messages[1:] -%}

{%- else -%}

{% set system_message = '' -%}

{%- endif -%}

{{ bos_token + system_message }}

{%- for message in messages -%}

{%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}

{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}

{%- endif -%}

{%- if message['role'] == 'user' -%}

{{ '<|User|>: ' + message['content'] + '\n' }}

{%- elif message['role'] == 'assistant' -%}

{{ '<|Assistant|>: ' + message['content'] + eos_token + '\n' }}

{%- endif -%}

{%- endfor -%}

{%- if add_generation_prompt -%}

{{ '<|Assistant|>: ' }}

{% endif %}

You can add --chat-template examples/template_deepseek_vl2.jinja when launching the server to use the template. (Not sure if the path is correct in docker image though 😂 )

@Isotr0py it works.
Thanks.

heibaidaolx123 · 2025-02-05T09:34:48Z

@Isotr0py I used 4*A40, using official image v0.7.1 and then reinstall vllm with command:

pip uninstall -y vllm && pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

command to start the app:

docker run -d --rm --network=host --name=vllm-deepseekvl2 \
 --gpus all \
 -v $MODEL:/model \
 $IMAGE \
 --model /model \
 -tp 4 \
 --disable-log-stats \
 --port 23333 \
 --hf_overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}' \
 --chat-template /vllm-workspace/examples/template_deepseek_vl2.jinja \
 --max-model-len 4096 \
 --gpu-memory-utilization 0.95 \
 --max-num-batched-tokens 8192

the startup and first 100 requests seem normal.

error encountered:

INFO 02-05 01:20:00 engine.py:275] Added request chatcmpl-87294cc938af457f9628dd82be72189c.
ERROR 02-05 01:21:05 client.py:300] RuntimeError('Engine process (pid 76) died.')
ERROR 02-05 01:21:05 client.py:300] NoneType: None
CRITICAL 02-05 01:21:10 launcher.py:101] MQLLMEngine is already dead, terminating server process
INFO:     10.44.128.16:28304 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1]

Isotr0py · 2025-02-05T10:28:12Z

@heibaidaolx123 Can you provide full traceback and inputs (prompt and image) so that I can reproduce the error?

heibaidaolx123 · 2025-02-05T12:10:39Z

@Isotr0py
The failure was not due to some paticular prompt and image.
I reran the test, app dies after dozens of requests.

the debug log:

INFO 02-05 03:53:55 engine.py:275] Added request chatcmpl-295569e43a3d4ae1bd0dc1440805690a.
DEBUG 02-05 03:54:02 client.py:170] Heartbeat successful.
DEBUG 02-05 03:54:05 client.py:191] Waiting for output from MQLLMEngine.
DEBUG 02-05 03:54:12 client.py:170] Heartbeat successful.
DEBUG 02-05 03:54:15 client.py:191] Waiting for output from MQLLMEngine.
DEBUG 02-05 03:54:22 client.py:170] Heartbeat successful.
DEBUG 02-05 03:54:25 client.py:191] Waiting for output from MQLLMEngine.
DEBUG 02-05 03:54:32 client.py:170] Heartbeat successful.
DEBUG 02-05 03:54:35 client.py:191] Waiting for output from MQLLMEngine.
DEBUG 02-05 03:54:42 client.py:170] Heartbeat successful.
DEBUG 02-05 03:54:45 client.py:191] Waiting for output from MQLLMEngine.
DEBUG 02-05 03:54:52 client.py:170] Heartbeat successful.
ERROR 02-05 03:54:52 client.py:300] RuntimeError('Engine process (pid 76) died.')
ERROR 02-05 03:54:52 client.py:300] NoneType: None
DEBUG 02-05 03:54:55 client.py:191] Waiting for output from MQLLMEngine.
CRITICAL 02-05 03:54:55 launcher.py:101] MQLLMEngine is already dead, terminating server process
INFO:     10.44.128.16:23162 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(VllmWorkerProcess pid=350) DEBUG 02-05 03:54:55 shm_broadcast.py:417] No available block found in 60 second. 
(VllmWorkerProcess pid=348) DEBUG 02-05 03:54:55 shm_broadcast.py:417] No available block found in 60 second. 
(VllmWorkerProcess pid=349) DEBUG 02-05 03:54:55 shm_broadcast.py:417] No available block found in 60 second. 
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1]

When setting tp=2, everything works well, except that the quality of output text is not as good as expected.

Isotr0py · 2025-02-05T12:45:08Z

app dies after dozens of requests.

When setting tp=2, everything works well, except that the quality of output text is not as good as expected.

This is odd, especially the engine died and shutdown silently without any errors, can you open a new issue for this?

heibaidaolx123 · 2025-02-05T13:25:49Z

@Isotr0py will do.

@t-parry

* [Model] Initialize support for Deepseek-VL2 models (vllm-project#11578) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Hardware][CPU] Multi-LoRA implementation for the CPU backend (vllm-project#11100) Signed-off-by: Akshat Tripathi <[email protected]> Signed-off-by: Oleg Mosalov <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Oleg Mosalov <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [Hardware][TPU] workaround fix for MoE on TPU (vllm-project#11764) * [V1][Core][1/n] Logging and Metrics (vllm-project#11962) Signed-off-by: [email protected] <[email protected]> * [Model] Support GGUF models newly added in `transformers` 4.46.0 (vllm-project#9685) Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (vllm-project#11973) Signed-off-by: [email protected] <[email protected]> * [MISC] fix typo in kv transfer send recv test (vllm-project#11983) * [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (vllm-project#11979) * [CI][Spec Decode] fix: broken test for EAGLE model (vllm-project#11972) Signed-off-by: Sungjae Lee <[email protected]> * [Misc] Fix Deepseek V2 fp8 kv-scale remapping (vllm-project#11947) Signed-off-by: Yida Wu <[email protected]> * [Misc]Minor Changes about Worker (vllm-project#11555) Signed-off-by: Chenguang Li <[email protected]> * [platform] add ray_device_key (vllm-project#11948) Signed-off-by: youkaichao <[email protected]> * Fix Max Token ID for Qwen-VL-Chat (vllm-project#11980) Signed-off-by: Alex-Brooks <[email protected]> * [Kernel] unified_attention for Attention.forward (vllm-project#11967) Signed-off-by: Chen Zhang <[email protected]> * [Doc][V1] Update model implementation guide for V1 support (vllm-project#11998) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Doc] Organise installation documentation into categories and tabs (vllm-project#11935) Signed-off-by: Harry Mellor <[email protected]> * [platform] add device_control env var (vllm-project#12009) Signed-off-by: youkaichao <[email protected]> * [Platform] Move get_punica_wrapper() function to Platform (vllm-project#11516) Signed-off-by: Shanshan Shen <[email protected]> * bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (vllm-project#11982) Signed-off-by: elijah <[email protected]> * [Doc] Fix build from source and installation link in README.md (vllm-project#12013) Signed-off-by: Yikun <[email protected]> * Using list * [Bugfix] Fix deepseekv3 gate bias error (vllm-project#12002) Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]> * Revert "[misc] improve memory profiling (vllm-project#11809)" This reverts commit 889e662. * Multi-lingual P3L (#356) * Commiting the *multilingual* P3L test. * Created a *multi-lingual* P3L test. * Making ruff happy. * . * Added a reference to the language-scripture Confluence table. * Typo fixing. * Harmonizing naming. * Fixing comments in the header. --------- Co-authored-by: Alexei V. Ivanov <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> * Trying to make scales work with compileable attention * [Docs] Add Sky Computing Lab to project intro (vllm-project#12019) Signed-off-by: Woosuk Kwon <[email protected]> * [HPU][Bugfix] set_forward_context and CI test execution (vllm-project#12014) Signed-off-by: Konrad Zawora <[email protected]> * [Doc] Update Quantization Hardware Support Documentation (vllm-project#12025) Signed-off-by: tjtanaa <[email protected]> Co-authored-by: tjtanaa <[email protected]> * [HPU][misc] add comments for explanation (vllm-project#12034) Signed-off-by: youkaichao <[email protected]> * [Bugfix] Fix various bugs in multi-modal processor (vllm-project#12031) Signed-off-by: DarkLight1337 <[email protected]> * [Kernel] Revert the API change of Attention.forward (vllm-project#12038) Signed-off-by: Chen Zhang <[email protected]> * [Platform] Add output for Attention Backend (vllm-project#11981) Signed-off-by: wangxiyuan <[email protected]> * [Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (vllm-project#12040) Signed-off-by: Chen Zhang <[email protected]> * Explain where the engine args go when using Docker (vllm-project#12041) Signed-off-by: Harry Mellor <[email protected]> * Docs lint * [Doc]: Update the Json Example of the `Engine Arguments` document (vllm-project#12045) * [Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping (vllm-project#11924) Signed-off-by: Jee Jee Li <[email protected]> * [Kernel] Support MulAndSilu (vllm-project#11624) Signed-off-by: Jee Jee Li <[email protected]> * [HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py (vllm-project#12046) Signed-off-by: Konrad Zawora <[email protected]> * [Platform] move current_memory_usage() into platform (vllm-project#11369) Signed-off-by: Shanshan Shen <[email protected]> * [V1][BugFix] Fix edge case in VLM scheduling (vllm-project#12065) Signed-off-by: Woosuk Kwon <[email protected]> * [Misc] Add multipstep chunked-prefill support for FlashInfer (vllm-project#10467) * [core] Turn off GPU communication overlap for Ray executor (vllm-project#12051) Signed-off-by: Rui Qiao <[email protected]> * [core] platform agnostic executor via collective_rpc (vllm-project#11256) Signed-off-by: youkaichao <[email protected]> * [Doc] Update examples to remove SparseAutoModelForCausalLM (vllm-project#12062) Signed-off-by: Kyle Sayers <[email protected]> * [V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager (vllm-project#12003) * Fix: cases with empty sparsity config (vllm-project#12057) Signed-off-by: Rahul Tuli <[email protected]> * Type-fix: make execute_model output type optional (vllm-project#12020) * [Platform] Do not raise error if _Backend is not found (vllm-project#12023) Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> * [Model]: Support internlm3 (vllm-project#12037) * Misc: allow to use proxy in `HTTPConnection` (vllm-project#12042) Signed-off-by: Yuan Zhou <[email protected]> * [Misc][Quark] Upstream Quark format to VLLM (vllm-project#10765) Signed-off-by: kewang-xlnx <[email protected]> Signed-off-by: kewang2 <[email protected]> Co-authored-by: kewang2 <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [Doc]: Update `OpenAI-Compatible Server` documents (vllm-project#12082) * [Bugfix] use right truncation for non-generative tasks (vllm-project#12050) Signed-off-by: Joe Runde <[email protected]> * [V1][Core] Autotune encoder cache budget (vllm-project#11895) Signed-off-by: Roger Wang <[email protected]> * [Bugfix] Fix _get_lora_device for HQQ marlin (vllm-project#12090) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * Allow hip sources to be directly included when compiling for rocm. (vllm-project#12087) * [Core] Default to using per_token quantization for fp8 when cutlass is supported. (vllm-project#8651) Signed-off-by: mgoin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: mgoin <[email protected]> * [Doc] Add documentation for specifying model architecture (vllm-project#12105) * Various cosmetic/comment fixes (vllm-project#12089) Signed-off-by: mgoin <[email protected]> * [Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (vllm-project#12067) Signed-off-by: Isotr0py <[email protected]> * Support torchrun and SPMD-style offline inference (vllm-project#12071) Signed-off-by: youkaichao <[email protected]> * [core] LLM.collective_rpc interface and RLHF example (vllm-project#12084) Signed-off-by: youkaichao <[email protected]> * [Bugfix] Fix max image feature size for Llava-one-vision (vllm-project#12104) Signed-off-by: Roger Wang <[email protected]> * Enable user marker for vllm profiling (#357) * Enable user marker for vllm profiling --------- Co-authored-by: Gregory Shtrasberg <[email protected]> * [misc] Add LoRA kernel micro benchmarks (vllm-project#11579) * [Model] Add support for deepseek-vl2-tiny model (vllm-project#12068) Signed-off-by: Isotr0py <[email protected]> * Deepseek V3 support (#364) * Changing the hard coded datatype to see if it's enough for the model to work * Picking the upstrteam moe kernel version * make upstream fix for v3 also works for rocm v2 * Conditional fnuz dtype * Requantizing from fn to fnuz * Requantizing moe as well * Actually requantizing moe weights * Conditional requantization and assert on padding in block quant * Format --------- Co-authored-by: charlifu <[email protected]> * [Bugfix] Set enforce_eager automatically for mllama (vllm-project#12127) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix] Fix a path bug in disaggregated prefill example script. (vllm-project#12121) Signed-off-by: Kuntai Du <[email protected]> * [CI]add genai-perf benchmark in nightly benchmark (vllm-project#10704) Signed-off-by: Kunshang Ji <[email protected]> * [Doc] Add instructions on using Podman when SELinux is active (vllm-project#12136) Signed-off-by: Yuan Tang <[email protected]> * [Bugfix] Fix issues in CPU build Dockerfile (vllm-project#12135) Signed-off-by: Yuan Tang <[email protected]> * [BugFix] add more `is not None` check in VllmConfig.__post_init__ (vllm-project#12138) Signed-off-by: Chen Zhang <[email protected]> * [Misc] Add deepseek_vl2 chat template (vllm-project#12143) Signed-off-by: Isotr0py <[email protected]> * [ROCm][MoE] moe tuning support for rocm (vllm-project#12049) Signed-off-by: Divakar Verma <[email protected]> * [V1] Move more control of kv cache initialization from model_executor to EngineCore (vllm-project#11960) Signed-off-by: Chen Zhang <[email protected]> Co-authored-by: Cody Yu <[email protected]> * [Misc][LoRA] Improve the readability of LoRA error messages (vllm-project#12102) Signed-off-by: Jee Jee Li <[email protected]> * [CI/Build][CPU][Bugfix] Fix CPU CI (vllm-project#12150) Signed-off-by: jiang1.li <[email protected]> * [core] allow callable in collective_rpc (vllm-project#12151) Signed-off-by: youkaichao <[email protected]> * [Bugfix] Fix score api for missing max_model_len validation (vllm-project#12119) Signed-off-by: Wallas Santos <[email protected]> * [Bugfix] Mistral tokenizer encode accept list of str (vllm-project#12149) Signed-off-by: Kunshang Ji <[email protected]> * [AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (vllm-project#12134) Signed-off-by: Gregory Shtrasberg <[email protected]> * [torch.compile] disable logging when cache is disabled (vllm-project#12043) Signed-off-by: youkaichao <[email protected]> * [misc] fix cross-node TP (vllm-project#12166) Signed-off-by: youkaichao <[email protected]> * [AMD][CI/Build][Bugfix] use pytorch stale wheel (vllm-project#12172) Signed-off-by: hongxyan <[email protected]> * [core] further polish memory profiling (vllm-project#12126) Signed-off-by: youkaichao <[email protected]> * [Docs] Fix broken link in SECURITY.md (vllm-project#12175) Signed-off-by: Russell Bryant <[email protected]> * [Model] Port deepseek-vl2 processor, remove dependency (vllm-project#12169) Signed-off-by: Isotr0py <[email protected]> * [core] clean up executor class hierarchy between v1 and v0 (vllm-project#12171) Signed-off-by: youkaichao <[email protected]> * [Misc] Support register quantization method out-of-tree (vllm-project#11969) * [V1] Collect env var for usage stats (vllm-project#12115) * [BUGFIX] Move scores to float32 in case of running xgrammar on cpu (vllm-project#12152) Signed-off-by: Michal Adamczyk <[email protected]> * [Bugfix] Fix multi-modal processors for transformers 4.48 (vllm-project#12187) * [torch.compile] store inductor compiled Python file (vllm-project#12182) Signed-off-by: youkaichao <[email protected]> * benchmark_serving support --served-model-name param (vllm-project#12109) Signed-off-by: zibai <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Misc] Add BNB support to GLM4-V model (vllm-project#12184) Signed-off-by: Isotr0py <[email protected]> * [V1] Add V1 support of Qwen2-VL (vllm-project#12128) Signed-off-by: Roger Wang <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: imkero <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * [Model] Support for fairseq2 Llama (vllm-project#11442) Signed-off-by: Martin Gleize <[email protected]> Co-authored-by: mgleize user <[email protected]> * [Bugfix] Fix num_heads value for simple connector when tp enabled (vllm-project#12074) Signed-off-by: Shangming Cai <[email protected]> * [torch.compile] fix sym_tensor_indices (vllm-project#12191) Signed-off-by: youkaichao <[email protected]> * Move linting to `pre-commit` (vllm-project#11975) Signed-off-by: Harry Mellor <[email protected]> * [DOC] Fix typo in docstring and assert message (vllm-project#12194) Signed-off-by: Yuan Tang <[email protected]> * [DOC] Add missing docstring in LLMEngine.add_request() (vllm-project#12195) Signed-off-by: Yuan Tang <[email protected]> * [Bugfix] Fix incorrect types in LayerwiseProfileResults (vllm-project#12196) Signed-off-by: Yuan Tang <[email protected]> * [Model] Add Qwen2 PRM model support (vllm-project#12202) Signed-off-by: Isotr0py <[email protected]> * [Core] Interface for accessing model from `VllmRunner` (vllm-project#10353) Signed-off-by: DarkLight1337 <[email protected]> * [misc] add placeholder format.sh (vllm-project#12206) Signed-off-by: youkaichao <[email protected]> * [CI/Build] Remove dummy CI steps (vllm-project#12208) Signed-off-by: DarkLight1337 <[email protected]> * [CI/Build] Make pre-commit faster (vllm-project#12212) Signed-off-by: DarkLight1337 <[email protected]> * [Model] Upgrade Aria to transformers 4.48 (vllm-project#12203) Signed-off-by: DarkLight1337 <[email protected]> * [misc] print a message to suggest how to bypass commit hooks (vllm-project#12217) Signed-off-by: youkaichao <[email protected]> * [core][bugfix] configure env var during import vllm (vllm-project#12209) Signed-off-by: youkaichao <[email protected]> * [V1] Remove `_get_cache_block_size` (vllm-project#12214) Signed-off-by: Chen Zhang <[email protected]> * [Misc] Pass `attention` to impl backend (vllm-project#12218) Signed-off-by: wangxiyuan <[email protected]> * [Bugfix] Fix `HfExampleModels.find_hf_info` (vllm-project#12223) Signed-off-by: DarkLight1337 <[email protected]> * [CI] Pass local python version explicitly to pre-commit mypy.sh (vllm-project#12224) Signed-off-by: Chen Zhang <[email protected]> * Using ROCm6.3.1 base docker and building hipblas-common (#366) * [Misc] Update CODEOWNERS (vllm-project#12229) * fix: update platform detection for M-series arm based MacBook processors (vllm-project#12227) Signed-off-by: isikhi <[email protected]> * [misc] add cuda runtime version to usage data (vllm-project#12190) Signed-off-by: youkaichao <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [bugfix] catch xgrammar unsupported array constraints (vllm-project#12210) Signed-off-by: Jason Cheng <[email protected]> * [Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) (vllm-project#12222) Signed-off-by: Jinzhen Lin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * Add quantization and guided decoding CODEOWNERS (vllm-project#12228) Signed-off-by: mgoin <[email protected]> * [AMD][Build] Porting dockerfiles from the ROCm/vllm fork (vllm-project#11777) Signed-off-by: Gregory Shtrasberg <[email protected]> * [BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (vllm-project#12230) Signed-off-by: NickLucche <[email protected]> * [ci/build] disable failed and flaky tests (vllm-project#12240) Signed-off-by: youkaichao <[email protected]> * [Misc] Rename `MultiModalInputsV2 -> MultiModalInputs` (vllm-project#12244) Signed-off-by: DarkLight1337 <[email protected]> * [Misc]Add BNB quantization for PaliGemmaForConditionalGeneration (vllm-project#12237) Signed-off-by: Jee Jee Li <[email protected]> * [Misc] Remove redundant TypeVar from base model (vllm-project#12248) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix mm_limits access for merged multi-modal processor (vllm-project#12252) Signed-off-by: DarkLight1337 <[email protected]> * [torch.compile] transparent compilation with more logging (vllm-project#12246) Signed-off-by: youkaichao <[email protected]> * [V1][Bugfix] Fix data item ordering in mixed-modality inference (vllm-project#12259) Signed-off-by: Roger Wang <[email protected]> * Remove pytorch comments for outlines + compressed-tensors (vllm-project#12260) Signed-off-by: Thomas Parnell <[email protected]> * [Platform] improve platforms getattr (vllm-project#12264) Signed-off-by: Mengqing Cao <[email protected]> * [ci/build] update nightly torch for gh200 test (vllm-project#12270) Signed-off-by: youkaichao <[email protected]> * [Bugfix] fix race condition that leads to wrong order of token returned (vllm-project#10802) Signed-off-by: Jannis Schönleber <[email protected]> * [Kernel] fix moe_align_block_size error condition (vllm-project#12239) Signed-off-by: Jinzhen Lin <[email protected]> * [v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (vllm-project#10907) Signed-off-by: rickyx <[email protected]> * [Bugfix] Multi-sequence broken (vllm-project#11898) Signed-off-by: Andy Lo <[email protected]> * [Misc] Remove experimental dep from tracing.py (vllm-project#12007) Signed-off-by: Adrian Cole <[email protected]> * [Misc] Set default backend to SDPA for get_vit_attn_backend (vllm-project#12235) Signed-off-by: wangxiyuan <[email protected]> * [Core] Free CPU pinned memory on environment cleanup (vllm-project#10477) * Update pre-commit.yml (#374) * Update pre-commit.yml * Reapplying missing format * New codespell exclude location --------- Co-authored-by: Kevin H. Luu <[email protected]> * [bugfix] moe tuning. rm is_navi() (vllm-project#12273) Signed-off-by: Divakar Verma <[email protected]> * [BUGFIX] When skip_tokenize_init and multistep are set, execution crashes (vllm-project#12277) Signed-off-by: maleksan85 <[email protected]> Co-authored-by: maleksan85 <[email protected]> * [Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose (vllm-project#12281) Signed-off-by: Hongxia Yang <[email protected]> * [VLM] Simplify post-processing of replacement info (vllm-project#12269) Signed-off-by: DarkLight1337 <[email protected]> * [ci/lint] Add back default arg for pre-commit (vllm-project#12279) Signed-off-by: kevin <[email protected]> * [CI] add docker volume prune to neuron CI (vllm-project#12291) Signed-off-by: Liangfu Chen <[email protected]> * [Ci/Build] Fix mypy errors on main (vllm-project#12296) Signed-off-by: DarkLight1337 <[email protected]> * [Benchmark] More accurate TPOT calc in `benchmark_serving.py` (vllm-project#12288) Signed-off-by: Nick Hill <[email protected]> * [core] separate builder init and builder prepare for each batch (vllm-project#12253) Signed-off-by: youkaichao <[email protected]> * [Build] update requirements of no-device (vllm-project#12299) Signed-off-by: Mengqing Cao <[email protected]> * [Core] Support fully transparent sleep mode (vllm-project#11743) Signed-off-by: youkaichao <[email protected]> * [VLM] Avoid unnecessary tokenization (vllm-project#12310) Signed-off-by: DarkLight1337 <[email protected]> * [Model][Bugfix]: correct Aria model output (vllm-project#12309) Signed-off-by: xffxff <[email protected]> * [Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 (vllm-project#12313) Signed-off-by: Roger Wang <[email protected]> * [Doc] Add docs for prompt replacement (vllm-project#12318) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Fix the error in the tip for the --lora-modules parameter (vllm-project#12319) Signed-off-by: wangerxiao <[email protected]> * [Misc] Improve the readability of BNB error messages (vllm-project#12320) Signed-off-by: Jee Jee Li <[email protected]> * Skip tokenize/detokenize when it is disabled by arg --skip-tokenizer-init (#367) * switching detokenize flag to be False * detokenize = False for benchmarks * restoring default in main vllm code for detokenize * removing extra spaces * moving detokenize to flag * adding support for token ids --------- Co-authored-by: maleksan85 <[email protected]> * [Bugfix] Fix HPU multiprocessing executor (vllm-project#12167) Signed-off-by: Konrad Zawora <[email protected]> * [Core] Support `reset_prefix_cache` (vllm-project#12284) * [Frontend][V1] Online serving performance improvements (vllm-project#12287) * [AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (vllm-project#12282) Signed-off-by: Randall Smith <[email protected]> * FP8 FA fixes (#381) * FP8 FA fixes Summary: Add missing clamp and fix reciprocal scale computation. * linter * Returning the use of the proper stream in allreduce (#382) * [Bugfix] Fixing AMD LoRA CI test. (vllm-project#12329) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Docs] Update FP8 KV Cache documentation (vllm-project#12238) Signed-off-by: mgoin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Docs] Document vulnerability disclosure process (vllm-project#12326) Signed-off-by: Russell Bryant <[email protected]> * [V1] Add `uncache_blocks` (vllm-project#12333) * [doc] explain common errors around torch.compile (vllm-project#12340) Signed-off-by: youkaichao <[email protected]> * [Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update (vllm-project#12338) Signed-off-by: zhenwei <[email protected]> * [Bugfix] Fix k_proj's bias for whisper self attention (vllm-project#12342) Signed-off-by: Isotr0py <[email protected]> * [Kernel] Flash Attention 3 Support (vllm-project#12093) Signed-off-by: Lucas Wilkinson <[email protected]> * [Doc] Troubleshooting errors during model inspection (vllm-project#12351) Signed-off-by: DarkLight1337 <[email protected]> * [V1] Simplify M-RoPE (vllm-project#12352) Signed-off-by: Roger Wang <[email protected]> Co-authored-by: imkero <[email protected]> * [Bugfix] Fix broken internvl2 inference with v1 (vllm-project#12360) Signed-off-by: Isotr0py <[email protected]> * [core] add wake_up doc and some sanity check (vllm-project#12361) Signed-off-by: youkaichao <[email protected]> * [torch.compile] decouple compile sizes and cudagraph sizes (vllm-project#12243) Signed-off-by: youkaichao <[email protected]> * [FP8][Kernel] Dynamic kv cache scaling factors computation (vllm-project#11906) Signed-off-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Micah Williamson <[email protected]> * [TPU] Update TPU CI to use torchxla nightly on 20250122 (vllm-project#12334) Signed-off-by: Siyuan Liu <[email protected]> * [Docs] Document Phi-4 support (vllm-project#12362) Signed-off-by: Isotr0py <[email protected]> * [BugFix] Fix parameter names and `process_after_weight_loading` for W4A16 MoE Group Act Order (vllm-project#11528) Signed-off-by: ElizaWszola <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (vllm-project#12357) Signed-off-by: Junichi Sato <[email protected]> * [Docs] Add meetup slides (vllm-project#12345) Signed-off-by: Woosuk Kwon <[email protected]> * Using pytorch commit past the point when rowwise PR (pytorch/pytorch#144432) was merged (#384) * [Docs] Update spec decode + structured output in compat matrix (vllm-project#12373) Signed-off-by: Russell Bryant <[email protected]> * [V1][Frontend] Coalesce bunched `RequestOutput`s (vllm-project#12298) Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * Set weights_only=True when using torch.load() (vllm-project#12366) Signed-off-by: Russell Bryant <[email protected]> * [Bugfix] Path join when building local path for S3 clone (vllm-project#12353) Signed-off-by: Omer Dayan (SW-GPU) <[email protected]> * Update compressed-tensors version (vllm-project#12367) * [V1] Increase default batch size for H100/H200 (vllm-project#12369) Signed-off-by: Woosuk Kwon <[email protected]> * [perf] fix perf regression from vllm-project#12253 (vllm-project#12380) Signed-off-by: youkaichao <[email protected]> * [Misc] Use VisionArena Dataset for VLM Benchmarking (vllm-project#12389) Signed-off-by: Roger Wang <[email protected]> * [ci/build] fix wheel size check (vllm-project#12396) Signed-off-by: youkaichao <[email protected]> * [Hardware][Gaudi][Doc] Add missing step in setup instructions (vllm-project#12382) * [ci/build] sync default value for wheel size (vllm-project#12398) Signed-off-by: youkaichao <[email protected]> * [Misc] Enable proxy support in benchmark script (vllm-project#12356) Signed-off-by: Junichi Sato <[email protected]> * [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (vllm-project#12375) Signed-off-by: Lucas Wilkinson <[email protected]> * Applying scales rename to fp8 config (#387) * [Misc] Remove deprecated code (vllm-project#12383) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). (vllm-project#12405) Signed-off-by: Lucas Wilkinson <[email protected]> * Dev-docker Documentation Updates (#378) * Dev-docker Documentation Updates Minor updates to several sections, with links to other documents where appropriate. * Fix formatting of GEMM filename * README cleanup - Reorder some sections of the README to make them easier to follow - Improve formatting of bash commands - Prefer use of huggingface model names instead of hard-coded directories - Clean up wording * Expanded sample commands for Latency and Throughput * Fix markdown links * Fix pre-commit errors * Updates from review Initial updates to incorporate feedback from a review session held with @t-parry * Update script args to match current recommendations * Remove recommended max-num-seqs values for now --------- Co-authored-by: Gregory Shtrasberg <[email protected]> * [Bugfix][Kernel] Fix moe align block issue for mixtral (vllm-project#12413) * [Bugfix] Fix BLIP-2 processing (vllm-project#12412) Signed-off-by: DarkLight1337 <[email protected]> * [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (vllm-project#12408) Signed-off-by: Divakar Verma <[email protected]> * [Misc] Add FA2 support to ViT MHA layer (vllm-project#12355) Signed-off-by: Isotr0py <[email protected]> * [TPU][CI] Update torchxla version in requirement-tpu.txt (vllm-project#12422) Signed-off-by: Siyuan Liu <[email protected]> * [Misc][Bugfix] FA3 support to ViT MHA layer (vllm-project#12435) Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [V1][Perf] Reduce scheduling overhead in model runner after cuda sync (vllm-project#12094) Signed-off-by: Keyun Tong <[email protected]> * [V1][Bugfix] Fix assertion when mm hashing is turned off (vllm-project#12439) Signed-off-by: Roger Wang <[email protected]> * [Misc] Revert FA on ViT vllm-project#12355 and vllm-project#12435 (vllm-project#12445) * [Frontend] generation_config.json for maximum tokens(vllm-project#12242) Signed-off-by: Matthew Hendrey <[email protected]> Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: wangxiyuan <[email protected]> Co-authored-by: shangmingc <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: wangxiyuan <[email protected]> * [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (vllm-project#12417) Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: mgoin <[email protected]> * [Bugfix/CI] Fix broken kernels/test_mha.py (vllm-project#12450) * [Bugfix][Kernel] Fix perf regression caused by PR vllm-project#12405 (vllm-project#12434) Signed-off-by: Lucas Wilkinson <[email protected]> * [Build/CI] Fix libcuda.so linkage (vllm-project#12424) Signed-off-by: Tyler Michael Smith <[email protected]> * [Frontend] Rerank API (Jina- and Cohere-compatible API) (vllm-project#12376) Signed-off-by: Kyle Mistele <[email protected]> * [DOC] Add link to vLLM blog (vllm-project#12460) Signed-off-by: Yuan Tang <[email protected]> * [V1] Avoid list creation in input preparation (vllm-project#12457) Signed-off-by: Woosuk Kwon <[email protected]> * [Frontend] Support scores endpoint in run_batch (vllm-project#12430) Signed-off-by: Pooya Davoodi <[email protected]> * [Bugfix] Fix Granite 3.0 MoE model loading (vllm-project#12446) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix missing seq_start_loc in xformers prefill metadata (vllm-project#12464) Signed-off-by: Isotr0py <[email protected]> * [V1][Minor] Minor optimizations for update_from_output (vllm-project#12454) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix] Fix gpt2 GGUF inference (vllm-project#12467) Signed-off-by: Isotr0py <[email protected]> * [Build] Only build 9.0a for scaled_mm and sparse kernels (vllm-project#12339) Signed-off-by: Lucas Wilkinson <[email protected]> * [V1][Metrics] Add initial Prometheus logger (vllm-project#12416) Signed-off-by: Mark McLoughlin <[email protected]> * [V1][CI/Test] Do basic test for top-p & top-k sampling (vllm-project#12469) Signed-off-by: Woosuk Kwon <[email protected]> * [FlashInfer] Upgrade to 0.2.0 (vllm-project#11194) Signed-off-by: Bowen Wang <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]> * Support FP8 FA from Quark format (#388) * Support FP8 FA from Quark format * Support FP8 FA from Quark format * nit: update comment * Direct call on ROCm * 20250127 docs update (#392) * updating code blocks * typo * updated manifest * Including feedback * whitespace * Deepseek instructions * hyperlink fix * hyperlink fix * updating what is new * cpx update * typo * whitespace * whitespace * Faster Custom Paged Attention kernels (#372) * integrate new cpa kernel, update tests and benchmark * added comments to mfma4 kernel * further comments for mfma16 kernel * clang-format * Lint * add flag for logits rtz conversion and disable by default * lint * [Bugfix]: Fix paged attention unit tests of #372 (#389) * [Bugfix]: fix paged attention tests based on the updated kernels in `csrc/attention/paged_attention_v1.cu`,`csrc/attention/paged_attention_v2.cu` and `csrc/rocm/attention.cu`. * improve code documentation. * lint --------- Co-authored-by: vllmellm <[email protected]> --------- Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Joe Shajrawi <[email protected]> Co-authored-by: TJian <[email protected]> Co-authored-by: vllmellm <[email protected]> * Using a more precise profiling on ROCm to properly account for weights padding (#394) * Update Dockerfile.rocm * [Bugfix]: inclucde the env variables required for running FastSyncLLM Signed-off-by: vllmellm <[email protected]> * fix pre-commit lint Signed-off-by: vllmellm <[email protected]> * [Bugfix] included missing environment variable Signed-off-by: vllmellm <[email protected]> --------- Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Akshat Tripathi <[email protected]> Signed-off-by: Oleg Mosalov <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Sungjae Lee <[email protected]> Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Chenguang Li <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Shanshan Shen <[email protected]> Signed-off-by: elijah <[email protected]> Signed-off-by: Yikun <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: tjtanaa <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: yisheng <[email protected]> Signed-off-by: Abatom <[email protected]> Signed-off-by: Liangfu Chen <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Yuan Zhou <[email protected]> Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: Ilya Lavrenov <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Wallas Santos <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: yan ma <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]> Signed-off-by: Ye Qi <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Kunshang Ji <[email protected]> Signed-off-by: Kuntai Du <[email protected]> Signed-off-by: Ren MinMin <[email protected]> Signed-off-by: Travis Johnson <[email protected]> Signed-off-by: Fred Reiss <[email protected]> Signed-off-by: shaochangxu.scx <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: Rafael Vasquez <[email protected]> Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: Rahul Tuli <[email protected]> Signed-off-by: kewang-xlnx <[email protected]> Signed-off-by: kewang2 <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: Divakar Verma <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: hongxyan <[email protected]> Signed-off-by: Michal Adamczyk <[email protected]> Signed-off-by: zibai <[email protected]> Signed-off-by: Martin Gleize <[email protected]> Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: isikhi <[email protected]> Signed-off-by: Jason Cheng <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]> Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Jannis Schönleber <[email protected]> Signed-off-by: rickyx <[email protected]> Signed-off-by: Andy Lo <[email protected]> Signed-off-by: Adrian Cole <[email protected]> Signed-off-by: maleksan85 <[email protected]> Signed-off-by: Hongxia Yang <[email protected]> Signed-off-by: kevin <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: xffxff <[email protected]> Signed-off-by: wangerxiao <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: zhenwei <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: ElizaWszola <[email protected]> Signed-off-by: Junichi Sato <[email protected]> Signed-off-by: Omer Dayan (SW-GPU) <[email protected]> Signed-off-by: Keyun Tong <[email protected]> Signed-off-by: Matthew Hendrey <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Kyle Mistele <[email protected]> Signed-off-by: Pooya Davoodi <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: Bowen Wang <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Akshat Tripathi <[email protected]> Co-authored-by: Oleg Mosalov <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Avshalom Manevich <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Yangcheng Li <[email protected]> Co-authored-by: Siyuan Li <[email protected]> Co-authored-by: Sungjae Lee <[email protected]> Co-authored-by: Concurrensee <[email protected]> Co-authored-by: Chenguang Li <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Shanshan Shen <[email protected]> Co-authored-by: elijah <[email protected]> Co-authored-by: Yikun Jiang <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Steve Luo <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Alexei V. Ivanov <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Co-authored-by: maang-h <[email protected]> Co-authored-by: YiSheng5 <[email protected]> Co-authored-by: Zhonghua Deng <[email protected]> Co-authored-by: Liangfu Chen <[email protected]> Co-authored-by: XiaobingZhang <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Yuan <[email protected]> Co-authored-by: jiangjiadi <[email protected]> Co-authored-by: jiadi.jjd <[email protected]> Co-authored-by: sroy745 <[email protected]> Co-authored-by: Jie Fu (傅杰) <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: WangErXiao <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: Ilya Lavrenov <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Wallas Henrique <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Yan Ma <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Maxime Fournioux <[email protected]> Co-authored-by: Guspan Tanadi <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: yeq <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> Co-authored-by: Charles Frye <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: cennn <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: minmin <[email protected]> Co-authored-by: Ren MinMin <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Fred Reiss <[email protected]> Co-authored-by: shaochangxu <[email protected]> Co-authored-by: shaochangxu.scx <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: sixgod <[email protected]> Co-authored-by: Rafael Vasquez <[email protected]> Co-authored-by: Elfie Guo <[email protected]> Co-authored-by: Rui Qiao <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: Rahul Tuli <[email protected]> Co-authored-by: Keyun Tong <[email protected]> Co-authored-by: RunningLeon <[email protected]> Co-authored-by: kewang-xlnx <[email protected]> Co-authored-by: kewang2 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: tvirolai-amd <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Zhaoyi Li <[email protected]> Co-authored-by: charlifu <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: yancong <[email protected]> Co-authored-by: Michal Adamczyk <[email protected]> Co-authored-by: gujing <[email protected]> Co-authored-by: imkero <[email protected]> Co-authored-by: Martin Gleize <[email protected]> Co-authored-by: mgleize user <[email protected]> Co-authored-by: shangmingc <[email protected]> Co-authored-by: Işık <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Cheng Kuan Yong Jason <[email protected]> Co-authored-by: Jinzhen Lin <[email protected]> Co-authored-by: Thomas Parnell <[email protected]> Co-authored-by: Jannis Schönleber <[email protected]> Co-authored-by: Ricky Xu <[email protected]> Co-authored-by: Andy Lo <[email protected]> Co-authored-by: Adrian Cole <[email protected]> Co-authored-by: Jani Monoses <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: maleksan85 <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: zhou fan <[email protected]> Co-authored-by: ilia-cher <[email protected]> Co-authored-by: liuzhenwei <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Micah Williamson <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: Junichi Sato <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: omer-dayan <[email protected]> Co-authored-by: Mohit Deopujari <[email protected]> Co-authored-by: Jeremy Arnold <[email protected]> Co-authored-by: Matthew Hendrey <[email protected]> Co-authored-by: Kyle Mistele <[email protected]> Co-authored-by: Pooya Davoodi <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Bowen Wang <[email protected]> Co-authored-by: Bowen Bao <[email protected]> Co-authored-by: arakowsk-amd <[email protected]> Co-authored-by: sanyalington <[email protected]> Co-authored-by: Joe Shajrawi <[email protected]> Co-authored-by: vllmellm <[email protected]>

Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Isotr0py added 7 commits December 28, 2024 02:01

init deepseekvl2

b7f3a3b

Signed-off-by: Isotr0py <[email protected]>

port config

9846268

Signed-off-by: Isotr0py <[email protected]>

code format

0fdc10b

Signed-off-by: Isotr0py <[email protected]>

process image

19cf5e7

Signed-off-by: Isotr0py <[email protected]>

init processor

550ed2e

Signed-off-by: Isotr0py <[email protected]>

clean up

f2159c4

Signed-off-by: Isotr0py <[email protected]>

handle image embedding inputs

e20aba5

Signed-off-by: Isotr0py <[email protected]>

DarkLight1337 mentioned this pull request Dec 28, 2024

[RFC]: Multi-modality Support on vLLM #4194

Open

91 tasks

Isotr0py added 4 commits December 28, 2024 16:31

add multimodal processor

54c92fc

Signed-off-by: Isotr0py <[email protected]>

add max tokens implement

dd19a5d

Signed-off-by: Isotr0py <[email protected]>

implement embeddings merge

391ba13

Signed-off-by: Isotr0py <[email protected]>

add deepseek-vl2 example

bb88307

Signed-off-by: Isotr0py <[email protected]>

mergify bot added the frontend label Dec 28, 2024

Isotr0py added 5 commits December 28, 2024 20:52

register model

0ec661c

Signed-off-by: Isotr0py <[email protected]>

override example arch

847cb03

Signed-off-by: Isotr0py <[email protected]>

fix processor

bec7a43

Signed-off-by: Isotr0py <[email protected]>

Merge branch 'vllm-project:main' into deepseek-vl2

e417b98

fix config name

acc89f6

Signed-off-by: Isotr0py <[email protected]>

mergify bot added the needs-rebase label Dec 28, 2024

Isotr0py added 8 commits December 29, 2024 01:21

fix processor dtype

b9f2d4b

Signed-off-by: Isotr0py <[email protected]>

fix a typo

632c77c

Signed-off-by: Isotr0py <[email protected]>

fix vit

d97849d

Signed-off-by: Isotr0py <[email protected]>

fix a typo

6fb3845

Signed-off-by: Isotr0py <[email protected]>

add normal rope rotary

01a5316

Signed-off-by: Isotr0py <[email protected]>

code format

d5ebfcb

Signed-off-by: Isotr0py <[email protected]>

fix image token

d787200

Signed-off-by: Isotr0py <[email protected]>

update docs

d491ff0

Signed-off-by: Isotr0py <[email protected]>

mergify bot added the documentation Improvements or additions to documentation label Dec 30, 2024

Isotr0py mentioned this pull request Jan 15, 2025

[Model] Add support for deepseek-vl2-tiny model #12068

Merged

1 task

joennlae pushed a commit to 44ai-labs/vllm that referenced this pull request Jan 19, 2025

[Model] Initialize support for Deepseek-VL2 models (vllm-project#11578)

0df1050

Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

joennlae pushed a commit to 44ai-labs/vllm that referenced this pull request Jan 19, 2025

[Model] Initialize support for Deepseek-VL2 models (vllm-project#11578)

b59fcaa

Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

abmfy pushed a commit to abmfy/vllm-flashinfer that referenced this pull request Jan 24, 2025

[Model] Initialize support for Deepseek-VL2 models (vllm-project#11578)

d899804

Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

rasmith pushed a commit to rasmith/vllm that referenced this pull request Jan 30, 2025

[Model] Initialize support for Deepseek-VL2 models (vllm-project#11578)

24e28e7

Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Isotr0py mentioned this pull request Feb 4, 2025

[VLM] Add MLA with pure RoPE support for deepseek-vl2 models #12729

Merged

tlrmchlsmth mentioned this pull request Feb 5, 2025

v0.7.2 Release Tracker #12700

Closed

6 tasks

GWS0428 pushed a commit to GWS0428/VARserve that referenced this pull request Feb 12, 2025

[Model] Initialize support for Deepseek-VL2 models (vllm-project#11578)

a198d19

Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Initialize support for Deepseek-VL2 models #11578

[Model] Initialize support for Deepseek-VL2 models #11578

Isotr0py commented Dec 28, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 28, 2024

mergify bot commented Dec 28, 2024

iamweiliu commented Jan 14, 2025

iamweiliu commented Jan 14, 2025

Isotr0py commented Jan 14, 2025 •

edited

Loading

iamweiliu commented Jan 14, 2025

heibaidaolx123 commented Feb 4, 2025

Isotr0py commented Feb 4, 2025

haonan-li commented Feb 4, 2025

heibaidaolx123 commented Feb 5, 2025

Isotr0py commented Feb 5, 2025

heibaidaolx123 commented Feb 5, 2025

heibaidaolx123 commented Feb 5, 2025

Isotr0py commented Feb 5, 2025

heibaidaolx123 commented Feb 5, 2025 •

edited

Loading

Isotr0py commented Feb 5, 2025

heibaidaolx123 commented Feb 5, 2025

[Model] Initialize support for Deepseek-VL2 models #11578

[Model] Initialize support for Deepseek-VL2 models #11578

Conversation

Isotr0py commented Dec 28, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 28, 2024

mergify bot commented Dec 28, 2024

iamweiliu commented Jan 14, 2025

iamweiliu commented Jan 14, 2025

Isotr0py commented Jan 14, 2025 • edited Loading

iamweiliu commented Jan 14, 2025

heibaidaolx123 commented Feb 4, 2025

Isotr0py commented Feb 4, 2025

haonan-li commented Feb 4, 2025

heibaidaolx123 commented Feb 5, 2025

Isotr0py commented Feb 5, 2025

heibaidaolx123 commented Feb 5, 2025

heibaidaolx123 commented Feb 5, 2025

Isotr0py commented Feb 5, 2025

heibaidaolx123 commented Feb 5, 2025 • edited Loading

Isotr0py commented Feb 5, 2025

heibaidaolx123 commented Feb 5, 2025

Isotr0py commented Dec 28, 2024 •

edited by github-actions bot

Loading

Isotr0py commented Jan 14, 2025 •

edited

Loading

heibaidaolx123 commented Feb 5, 2025 •

edited

Loading