Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Initialize support for Deepseek-VL2 models #11578

Merged
merged 52 commits into from
Jan 12, 2025

Conversation

Isotr0py
Copy link
Collaborator

@Isotr0py Isotr0py commented Dec 28, 2024

FIX #11236

  • Initialize support for deepseek-vl2 series models
  • Note that deepseek-ai/deepseek-vl2-tiny is not supported yet because it doesn't use MLA attention.

Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added the frontend label Dec 28, 2024
Copy link

mergify bot commented Dec 28, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 28, 2024
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
@mergify mergify bot added the documentation Improvements or additions to documentation label Dec 30, 2024
@csdY123
Copy link

csdY123 commented Jan 10, 2025

@Isotr0py
You're awesome! You've done an amazing job.
You just need to deal with this bug a little bit to succeed,I have successfully run
[rank0]: File "vllm/vllm/model_executor/models/deepseek_v3.py", line 601, in load_weights
[rank0]: if self.config.num_nextn_predict_layers > 0:
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "anaconda3/envs/vllm/lib/python3.12/site-packages/transformers/configuration_utils.py", line 205, in getattribute
[rank0]: return super().getattribute(key)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'DeepseekV2Config' object has no attribute 'num_nextn_predict_layers'

@Isotr0py
Copy link
Collaborator Author

Isotr0py commented Jan 10, 2025

@csdY123 Added a check for num_nextn_predict_layers's existence before self.config.num_nextn_predict_layers, so the model should be able to load.

(Don't have device to test the full Deepseek-VL2 model right now, so your feedback is very valuable!) :)

@Isotr0py
Copy link
Collaborator Author

The DeepSeek-V3 based deepseek-vl2 model should also work now.

Outputs
$ python examples/offline_inference/offline_inference_vision_language.py -m deepseek_vl_v2
INFO 01-10 15:08:08 __init__.py:179] Automatically detected platform cuda.
INFO 01-10 15:08:10 config.py:285] Overriding HF config with {'architectures': ['DeepseekVLV2ForCausalLM']}
INFO 01-10 15:08:17 config.py:516] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'.
INFO 01-10 15:08:17 config.py:1022] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 01-10 15:08:17 llm_engine.py:234] Initializing an LLM engine (v0.1.dev3959+g8d9b672) with config: model='deepseek-ai/deepseek-vl2', speculative_config=None, tokenizer='deepseek-ai/deepseek-vl2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=deepseek-ai/deepseek-vl2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[2,1],"max_capture_size":2}, use_cached_outputs=False, 
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
INFO 01-10 15:08:19 cuda.py:176] Cannot use FlashAttention-2 backend for FP8 KV cache.
WARNING 01-10 15:08:19 cuda.py:178] Please use FlashInfer backend with FP8 KV Cache for better performance by setting environment variable  VLLM_ATTENTION_BACKEND=FLASHINFER
INFO 01-10 15:08:19 cuda.py:213] Using XFormers backend.
INFO 01-10 15:08:27 model_runner.py:1094] Starting to load model deepseek-ai/deepseek-vl2...
INFO 01-10 15:08:36 weight_utils.py:253] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/8 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  12% Completed | 1/8 [00:01<00:07,  1.13s/it]
Loading safetensors checkpoint shards:  25% Completed | 2/8 [00:02<00:07,  1.32s/it]
Loading safetensors checkpoint shards:  38% Completed | 3/8 [00:04<00:06,  1.37s/it]
Loading safetensors checkpoint shards:  50% Completed | 4/8 [00:04<00:04,  1.07s/it]
Loading safetensors checkpoint shards:  62% Completed | 5/8 [00:05<00:03,  1.13s/it]
Loading safetensors checkpoint shards:  75% Completed | 6/8 [00:07<00:02,  1.25s/it]
Loading safetensors checkpoint shards:  88% Completed | 7/8 [00:08<00:01,  1.32s/it]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:09<00:00,  1.10s/it]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:09<00:00,  1.18s/it]

INFO 01-10 15:08:46 model_runner.py:1099] Loading model weights took 51.2323 GB
WARNING 01-10 15:08:46 model_runner.py:1162] Using FP8 KV cache but no scaling factors provided. Defaulting to scaling factors of 1.0. This may lead to less accurate results!
Python version is above 3.10, patching the collections module.
Some kwargs in processor config are unused and will not have any effect: image_std, sft_format, downsample_ratio, normalize, candidate_resolutions, patch_size, image_token, add_special_token, ignore_id, image_mean, mask_prompt, pad_token. 
Add pad token = ['<|▁pad▁|>'] to the tokenizer
<|▁pad▁|>:2
Add image token = ['<image>'] to the tokenizer
<image>:128815
Add grounding-related tokens = ['<|ref|>', '<|/ref|>', '<|det|>', '<|/det|>', '<|grounding|>'] to the tokenizer with input_ids
<|ref|>:128816
<|/ref|>:128817
<|det|>:128818
<|/det|>:128819
<|grounding|>:128820
Add chat tokens = ['<|User|>', '<|Assistant|>'] to the tokenizer with input_ids
<|User|>:128821
<|Assistant|>:128822

Some kwargs in processor config are unused and will not have any effect: image_std, sft_format, downsample_ratio, normalize, candidate_resolutions, patch_size, image_token, add_special_token, ignore_id, image_mean, mask_prompt, pad_token. 
Add pad token = ['<|▁pad▁|>'] to the tokenizer
<|▁pad▁|>:2
Add image token = ['<image>'] to the tokenizer
<image>:128815
Add grounding-related tokens = ['<|ref|>', '<|/ref|>', '<|det|>', '<|/det|>', '<|grounding|>'] to the tokenizer with input_ids
<|ref|>:128816
<|/ref|>:128817
<|det|>:128818
<|/det|>:128819
<|grounding|>:128820
Add chat tokens = ['<|User|>', '<|Assistant|>'] to the tokenizer with input_ids
<|User|>:128821
<|Assistant|>:128822

You're using a CachedLlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

INFO 01-10 15:08:58 worker.py:241] Memory profiling takes 12.19 seconds
INFO 01-10 15:08:58 worker.py:241] the current vLLM instance can use total_gpu_memory (79.15GiB) x gpu_memory_utilization (0.90) = 71.24GiB
INFO 01-10 15:08:58 worker.py:241] model weights take 51.23GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 1.18GiB; the rest of the memory reserved for KV Cache is 18.67GiB.
INFO 01-10 15:08:58 gpu_executor.py:76] # GPU blocks: 2549, # CPU blocks: 546
INFO 01-10 15:08:58 gpu_executor.py:80] Maximum concurrency for 4096 tokens per request: 9.96x
INFO 01-10 15:09:17 model_runner.py:1416] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.42s/it]
INFO 01-10 15:09:22 model_runner.py:1542] Graph capturing finished in 5 secs, took 0.24 GiB
INFO 01-10 15:09:22 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 36.41 seconds
Some kwargs in processor config are unused and will not have any effect: image_std, sft_format, downsample_ratio, normalize, candidate_resolutions, patch_size, image_token, add_special_token, ignore_id, image_mean, mask_prompt, pad_token. 
Add pad token = ['<|▁pad▁|>'] to the tokenizer
<|▁pad▁|>:2
Add image token = ['<image>'] to the tokenizer
<image>:128815
Add grounding-related tokens = ['<|ref|>', '<|/ref|>', '<|det|>', '<|/det|>', '<|grounding|>'] to the tokenizer with input_ids
<|ref|>:128816
<|/ref|>:128817
<|det|>:128818
<|/det|>:128819
<|grounding|>:128820
Add chat tokens = ['<|User|>', '<|Assistant|>'] to the tokenizer with input_ids
<|User|>:128821
<|Assistant|>:128822

You're using a CachedLlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.78s/it, est. speed input: 802.93 toks/s, output: 25.69 toks/s]
The image shows a view of a tall tower, likely a communications or observation tower, surrounded by cherry blossom trees in full bloom. The sky is clear and blue, providing a beautiful backdrop to the scene.
The image features a tall tower, likely a communications or observation tower, surrounded by blooming cherry blossoms. The blossoms are in the foreground, with the tower rising into the background. The sky is clear and blue, providing a vibrant backdrop.
The image shows a view of a tall tower, likely a skyscraper or observation tower, with cherry blossoms in the foreground. The tower is surrounded by a clear blue sky, and the cherry blossoms are in full bloom, creating a beautiful and vibrant scene.
The image shows a view of a tall tower with a blue sky in the background. The foreground is filled with pink cherry blossoms, creating a beautiful contrast between the natural and man-made elements.

Copy link

mergify bot commented Jan 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 10, 2025
@mergify mergify bot removed the needs-rebase label Jan 11, 2025
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM. As per offline discussion, we can work on deepseek-ai/deepseek-vl2-tiny and the inner timm model in another PR.

Isotr0py and others added 3 commits January 12, 2025 00:04
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 11, 2025
@simon-mo simon-mo merged commit f967e51 into vllm-project:main Jan 12, 2025
74 of 77 checks passed
@Isotr0py Isotr0py deleted the deepseek-vl2 branch January 12, 2025 17:58
hmellor pushed a commit to hmellor/vllm that referenced this pull request Jan 12, 2025
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
@Swipe4057
Copy link

Swipe4057 commented Jan 14, 2025

CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server --model /data/models/deepseek-vl2 --served-model-name deepseek-vl2 --gpu_memory_utilization 0.9 --quantization fp8 --max-model-len 4096 --disable-log-requests

Result:

Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/venvs/lib/vllm/vllm/engine/multiprocessing/engine.py", line 389, in run_mp_engine
raise e
File "/data/venvs/lib/vllm/vllm/engine/multiprocessing/engine.py", line 378, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/venvs/lib/vllm/vllm/engine/multiprocessing/engine.py", line 116, in from_engine_args
engine_config = engine_args.create_engine_config(usage_context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/venvs/lib/vllm/vllm/engine/arg_utils.py", line 1043, in create_engine_config
model_config = self.create_model_config()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/venvs/lib/vllm/vllm/engine/arg_utils.py", line 969, in create_model_config
return ModelConfig(
^^^^^^^^^^^^
File "/data/venvs/lib/vllm/vllm/config.py", line 342, in init
self.multimodal_config = self._init_multimodal_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/venvs/lib/vllm/vllm/config.py", line 398, in _init_multimodal_config
if ModelRegistry.is_multimodal_model(architectures):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/venvs/lib/vllm/vllm/model_executor/models/registry.py", line 429, in is_multimodal_model
model_cls, _ = self.inspect_model_cls(architectures)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/venvs/lib/vllm/vllm/model_executor/models/registry.py", line 384, in inspect_model_cls
for arch in architectures:
TypeError: 'NoneType' object is not iterable

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jan 14, 2025

CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server --model /data/models/deepseek-vl2 --served-model-name deepseek-vl2 --gpu_memory_utilization 0.9 --quantization fp8 --max-model-len 4096 --disable-log-requests

Result:

Process SpawnProcess-1: Traceback (most recent call last): File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/data/venvs/lib/vllm/vllm/engine/multiprocessing/engine.py", line 389, in run_mp_engine raise e File "/data/venvs/lib/vllm/vllm/engine/multiprocessing/engine.py", line 378, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/venvs/lib/vllm/vllm/engine/multiprocessing/engine.py", line 116, in from_engine_args engine_config = engine_args.create_engine_config(usage_context) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/venvs/lib/vllm/vllm/engine/arg_utils.py", line 1043, in create_engine_config model_config = self.create_model_config() ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/venvs/lib/vllm/vllm/engine/arg_utils.py", line 969, in create_model_config return ModelConfig( ^^^^^^^^^^^^ File "/data/venvs/lib/vllm/vllm/config.py", line 342, in init self.multimodal_config = self._init_multimodal_config( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/venvs/lib/vllm/vllm/config.py", line 398, in _init_multimodal_config if ModelRegistry.is_multimodal_model(architectures): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/venvs/lib/vllm/vllm/model_executor/models/registry.py", line 429, in is_multimodal_model model_cls, _ = self.inspect_model_cls(architectures) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/venvs/lib/vllm/vllm/model_executor/models/registry.py", line 384, in inspect_model_cls for arch in architectures: TypeError: 'NoneType' object is not iterable

Can you show the full logs?

@Isotr0py
Copy link
Collaborator Author

File "/data/venvs/lib/vllm/vllm/model_executor/models/registry.py", line 384, in inspect_model_cls
for arch in architectures:
TypeError: 'NoneType' object is not iterable

The config.json in Deepseek-VL2's model repos are all missing the architectures field, so you need to specify --hf_overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}' or add architectures": ["DeepseekVLV2ForCausalLM"] to the config file manually.

@iamweiliu
Copy link

File "/data/venvs/lib/vllm/vllm/model_executor/models/registry.py", line 384, in inspect_model_cls
for arch in architectures:
TypeError: 'NoneType' object is not iterable

The config.json in Deepseek-VL2's model repos are all missing the architectures field, so you need to specify --hf_overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}' or add architectures": ["DeepseekVLV2ForCausalLM"] to the config file manually.

Save my life!

@iamweiliu
Copy link

ERROR 01-14 18:14:14 engine.py:387] AttributeError: 'DeepseekVLV2Config' object has no attribute 'hidden_size'

@Isotr0py
Copy link
Collaborator Author

Isotr0py commented Jan 14, 2025

@iamweiliu Can you provide the full logs? hidden_size should not be got from DeepseekVLV2Config because it doesn't have this field.

@iamweiliu
Copy link

@iamweiliu Can you provide the full logs? hidden_size should not be got from DeepseekVLV2Config because it doesn't have this field.

I already fix it. Just install https://github.com/Isotr0py/DeepSeek-VL2.

gshtras added a commit to ROCm/vllm that referenced this pull request Jan 14, 2025
* [Bugfix][V1] Fix molmo text-only inputs (vllm-project#11676)

Signed-off-by: Jee Jee Li <[email protected]>

* [Kernel] Move attn_type to Attention.__init__() (vllm-project#11690)

Signed-off-by: Chen Zhang <[email protected]>

* [V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision (vllm-project#11685)

Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>

* [Bugfix] Fix LLaVA-NeXT feature size precision error (for real) (vllm-project#11772)

Signed-off-by: DarkLight1337 <[email protected]>

* [Model] Future-proof Qwen2-Audio multi-modal processor (vllm-project#11776)

Signed-off-by: DarkLight1337 <[email protected]>

* [XPU] Make pp group initilized for pipeline-parallelism (vllm-project#11648)

Signed-off-by: yisheng <[email protected]>

* [Doc][3/N] Reorganize Serving section (vllm-project#11766)

Signed-off-by: DarkLight1337 <[email protected]>

* [Kernel][LoRA]Punica prefill  kernels fusion (vllm-project#11234)

Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Abatom <[email protected]>
Co-authored-by: Zhonghua Deng <[email protected]>

* [Bugfix] Update attention interface in `Whisper` (vllm-project#11784)

Signed-off-by: Roger Wang <[email protected]>

* [CI] Fix neuron CI and run offline tests (vllm-project#11779)

Signed-off-by: Liangfu Chen <[email protected]>

* fix init error for MessageQueue when n_local_reader is zero (vllm-project#11768)

* [Doc] Create a vulnerability management team (vllm-project#9925)

Signed-off-by: Russell Bryant <[email protected]>

* [CI][CPU] adding build number to docker image name (vllm-project#11788)

Signed-off-by: Yuan Zhou <[email protected]>

* [V1][Doc] Update V1 support for `LLaVa-NeXT-Video` (vllm-project#11798)

Signed-off-by: Roger Wang <[email protected]>

* [Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation (vllm-project#11800)

Signed-off-by: DarkLight1337 <[email protected]>

* [doc] add doc to explain how to use uv (vllm-project#11773)

Signed-off-by: youkaichao <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [V1] Support audio language models on V1 (vllm-project#11733)

Signed-off-by: Roger Wang <[email protected]>

* [doc] update how pip can install nightly wheels (vllm-project#11806)

Signed-off-by: youkaichao <[email protected]>

* [Doc] Add note to `gte-Qwen2` models (vllm-project#11808)

Signed-off-by: DarkLight1337 <[email protected]>

* [optimization] remove python function call for custom op (vllm-project#11750)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix] update the prefix for qwen2 (vllm-project#11795)

Co-authored-by: jiadi.jjd <[email protected]>

* [Doc]Add documentation for using EAGLE in vLLM (vllm-project#11417)

Signed-off-by: Sourashis Roy <[email protected]>

* [Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 (vllm-project#11794)

* [Doc] Group examples into categories (vllm-project#11782)

Signed-off-by: Harry Mellor <[email protected]>

* [Bugfix] Fix image input for Pixtral-HF (vllm-project#11741)

Signed-off-by: DarkLight1337 <[email protected]>

* [Misc] sort torch profiler table by kernel timing (vllm-project#11813)

* Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… (vllm-project#11824)

* Fixed docker build for ppc64le (vllm-project#11518)

Signed-off-by: Nishidha Panpaliya <[email protected]>

* [OpenVINO] Fixed Docker.openvino build (vllm-project#11732)

Signed-off-by: Ilya Lavrenov <[email protected]>

* [Bugfix] Add checks for LoRA and CPU offload (vllm-project#11810)

Signed-off-by: Jee Jee Li <[email protected]>

* [Docs] reorganize sponsorship page (vllm-project#11639)

Signed-off-by: simon-mo <[email protected]>

* [Bug] Fix pickling of `ModelConfig` when RunAI Model Streamer is used (vllm-project#11825)

Signed-off-by: DarkLight1337 <[email protected]>

* [misc] improve memory profiling (vllm-project#11809)

Signed-off-by: youkaichao <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [doc] update wheels url (vllm-project#11830)

Signed-off-by: youkaichao <[email protected]>

* [Docs] Update sponsor name: 'Novita' to 'Novita AI' (vllm-project#11833)

* [Hardware][Apple] Native support for macOS Apple Silicon (vllm-project#11696)

Signed-off-by: Wallas Santos <[email protected]>
Co-authored-by: Michael Goin <[email protected]>

* [torch.compile] consider relevant code in compilation cache (vllm-project#11614)

Signed-off-by: youkaichao <[email protected]>

* [VLM] Reorganize profiling/processing-related code (vllm-project#11812)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Move examples into categories (vllm-project#11840)

Signed-off-by: Harry Mellor <[email protected]>

* [Doc][4/N] Reorganize API Reference (vllm-project#11843)

Signed-off-by: DarkLight1337 <[email protected]>

* [CI/Build][Bugfix] Fix CPU CI image clean up (vllm-project#11836)

Signed-off-by: jiang1.li <[email protected]>

* [Bugfix][XPU] fix silu_and_mul (vllm-project#11823)

Signed-off-by: yan ma <[email protected]>

* [Misc] Move some model utils into vision file (vllm-project#11848)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Expand Multimodal API Reference (vllm-project#11852)

Signed-off-by: DarkLight1337 <[email protected]>

* [Misc]add some explanations for BlockHashType (vllm-project#11847)

* [TPU][Quantization] TPU `W8A8` (vllm-project#11785)

Co-authored-by: Woosuk Kwon <[email protected]>

* [Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models (vllm-project#11698)

Signed-off-by: Randall Smith <[email protected]>

* [Docs] Add Google Cloud Meetup (vllm-project#11864)

* [CI] Turn on basic correctness tests for V1 (vllm-project#10864)

* treat do_lower_case in the same way as the sentence-transformers library (vllm-project#11815)

Signed-off-by: Max de Bayser <[email protected]>

* [Doc] Recommend uv and python 3.12 for quickstart guide (vllm-project#11849)

Signed-off-by: mgoin <[email protected]>

* [Misc] Move `print_*_once` from utils to logger (vllm-project#11298)

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Maxime Fournioux <[email protected]>
Co-authored-by: Maxime Fournioux <[email protected]>

* [Doc] Intended links Python multiprocessing library (vllm-project#11878)

* [perf]fix current stream (vllm-project#11870)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix] Override dunder methods of placeholder modules (vllm-project#11882)

Signed-off-by: DarkLight1337 <[email protected]>

* [Bugfix] fix beam search input errors and latency benchmark script (vllm-project#11875)

Signed-off-by: Ye Qi <[email protected]>
Co-authored-by: yeq <[email protected]>

* [Doc] Add model development API Reference (vllm-project#11884)

Signed-off-by: DarkLight1337 <[email protected]>

* [platform] Allow platform specify attention backend (vllm-project#11609)

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: Mengqing Cao <[email protected]>
Co-authored-by: Mengqing Cao <[email protected]>

* [ci]try to fix flaky multi-step tests (vllm-project#11894)

Signed-off-by: youkaichao <[email protected]>

* [Misc] Provide correct Pixtral-HF chat template (vllm-project#11891)

Signed-off-by: DarkLight1337 <[email protected]>

* [Docs] Add Modal to deployment frameworks (vllm-project#11907)

* [Doc][5/N] Move Community and API Reference to the bottom (vllm-project#11896)

Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: Simon Mo <[email protected]>

* [VLM] Enable tokenized inputs for merged multi-modal processor (vllm-project#11900)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Show default pooling method in a table (vllm-project#11904)

Signed-off-by: DarkLight1337 <[email protected]>

* [torch.compile] Hide KV cache behind torch.compile boundary (vllm-project#11677)

Signed-off-by: Chen Zhang <[email protected]>

* [Bugfix] Validate lora adapters to avoid crashing server (vllm-project#11727)

Signed-off-by: Joe Runde <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>

* [BUGFIX] Fix `UnspecifiedPlatform` package name (vllm-project#11916)

Signed-off-by: Kunshang Ji <[email protected]>

* [ci] fix gh200 tests (vllm-project#11919)

Signed-off-by: youkaichao <[email protected]>

* [misc] remove python function call for custom activation op (vllm-project#11885)

Co-authored-by: youkaichao <[email protected]>

* [platform] support pytorch custom op pluggable (vllm-project#11328)

Signed-off-by: wangxiyuan <[email protected]>

* Replace "online inference" with "online serving" (vllm-project#11923)

Signed-off-by: Harry Mellor <[email protected]>

* [ci] Fix sampler tests (vllm-project#11922)

Signed-off-by: youkaichao <[email protected]>

* [Doc] [1/N] Initial guide for merged multi-modal processor (vllm-project#11925)

Signed-off-by: DarkLight1337 <[email protected]>

* [platform] support custom torch.compile backend key (vllm-project#11318)

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Co-authored-by: youkaichao <[email protected]>

* [Doc] Rename offline inference examples (vllm-project#11927)

Signed-off-by: Harry Mellor <[email protected]>

* [Docs] Fix docstring in `get_ip` function (vllm-project#11932)

Signed-off-by: Kuntai Du <[email protected]>

* Doc fix in `benchmark_long_document_qa_throughput.py` (vllm-project#11933)

Signed-off-by: Kuntai Du <[email protected]>

* [Hardware][CPU] Support MOE models on x86 CPU (vllm-project#11831)

Signed-off-by: jiang1.li <[email protected]>

* [Misc] Clean up debug code in Deepseek-V3 (vllm-project#11930)

Signed-off-by: Isotr0py <[email protected]>

* [Misc] Update benchmark_prefix_caching.py fixed example usage (vllm-project#11920)

Signed-off-by: Ren MinMin <[email protected]>
Co-authored-by: Ren MinMin <[email protected]>

* [Bugfix] Check that number of images matches number of <|image|> tokens with mllama (vllm-project#11939)

Signed-off-by: Travis Johnson <[email protected]>

* [mypy] Fix mypy warnings in api_server.py (vllm-project#11941)

Signed-off-by: Fred Reiss <[email protected]>

* [ci] fix broken distributed-tests-4-gpus (vllm-project#11937)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design (vllm-project#11672)

Signed-off-by: Sungjae Lee <[email protected]>

* [Bugfix] fused_experts_impl wrong compute type for float32 (vllm-project#11921)

Signed-off-by: shaochangxu.scx <[email protected]>
Co-authored-by: shaochangxu.scx <[email protected]>

* [CI/Build] Move model-specific multi-modal processing tests (vllm-project#11934)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Basic guide for writing unit tests for new models (vllm-project#11951)

Signed-off-by: DarkLight1337 <[email protected]>

* [Bugfix] Fix RobertaModel loading (vllm-project#11940)

Signed-off-by: NickLucche <[email protected]>

* [Model] Add cogagent model support vLLM (vllm-project#11742)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [V1] Avoid sending text prompt to core engine (vllm-project#11963)

Signed-off-by: Roger Wang <[email protected]>

* [CI/Build] Add markdown linter (vllm-project#11857)

Signed-off-by: Rafael Vasquez <[email protected]>

* [Model] Initialize support for Deepseek-VL2 models (vllm-project#11578)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [Hardware][CPU] Multi-LoRA implementation for the CPU backend (vllm-project#11100)

Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Oleg Mosalov <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
Co-authored-by: Oleg Mosalov <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [Hardware][TPU] workaround fix for MoE on TPU (vllm-project#11764)

* [V1][Core][1/n] Logging and Metrics (vllm-project#11962)

Signed-off-by: [email protected] <[email protected]>

* [Model] Support GGUF models newly added in `transformers` 4.46.0 (vllm-project#9685)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (vllm-project#11973)

Signed-off-by: [email protected] <[email protected]>

* [MISC] fix typo in kv transfer send recv test (vllm-project#11983)

* [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (vllm-project#11979)

* [CI][Spec Decode] fix: broken test for EAGLE model (vllm-project#11972)

Signed-off-by: Sungjae Lee <[email protected]>

* [Misc] Fix Deepseek V2 fp8 kv-scale remapping (vllm-project#11947)

Signed-off-by: Yida Wu <[email protected]>

* [Misc]Minor Changes about Worker (vllm-project#11555)

Signed-off-by: Chenguang Li <[email protected]>

* [platform] add ray_device_key (vllm-project#11948)

Signed-off-by: youkaichao <[email protected]>

* Fix Max Token ID for Qwen-VL-Chat (vllm-project#11980)

Signed-off-by: Alex-Brooks <[email protected]>

* [Kernel] unified_attention for Attention.forward (vllm-project#11967)

Signed-off-by: Chen Zhang <[email protected]>

* [Doc][V1] Update model implementation guide for V1 support (vllm-project#11998)

Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [Doc] Organise installation documentation into categories and tabs (vllm-project#11935)

Signed-off-by: Harry Mellor <[email protected]>

* [platform] add device_control env var (vllm-project#12009)

Signed-off-by: youkaichao <[email protected]>

* [Platform] Move get_punica_wrapper() function to Platform (vllm-project#11516)

Signed-off-by: Shanshan Shen <[email protected]>

* bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (vllm-project#11982)

Signed-off-by: elijah <[email protected]>

* Using list

* Revert "[misc] improve memory profiling (vllm-project#11809)"

This reverts commit 889e662.

* Trying to make scales work with compileable attention

* Docs lint

---------

Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: yisheng <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Liangfu Chen <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Yuan Zhou <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: Sourashis Roy <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Nishidha Panpaliya <[email protected]>
Signed-off-by: Ilya Lavrenov <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: jiang1.li <[email protected]>
Signed-off-by: yan ma <[email protected]>
Signed-off-by: Randall Smith <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: Maxime Fournioux <[email protected]>
Signed-off-by: Ye Qi <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: Mengqing Cao <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Kunshang Ji <[email protected]>
Signed-off-by: Kuntai Du <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Ren MinMin <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Fred Reiss <[email protected]>
Signed-off-by: Sungjae Lee <[email protected]>
Signed-off-by: shaochangxu.scx <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: Rafael Vasquez <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Oleg Mosalov <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: Yida Wu <[email protected]>
Signed-off-by: Chenguang Li <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Shanshan Shen <[email protected]>
Signed-off-by: elijah <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: Chen Zhang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>
Co-authored-by: YiSheng5 <[email protected]>
Co-authored-by: Zhonghua Deng <[email protected]>
Co-authored-by: Liangfu Chen <[email protected]>
Co-authored-by: XiaobingZhang <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
Co-authored-by: Yuan <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: jiangjiadi <[email protected]>
Co-authored-by: jiadi.jjd <[email protected]>
Co-authored-by: sroy745 <[email protected]>
Co-authored-by: Jie Fu (傅杰) <[email protected]>
Co-authored-by: Harry Mellor <[email protected]>
Co-authored-by: Divakar Verma <[email protected]>
Co-authored-by: WangErXiao <[email protected]>
Co-authored-by: Nishidha <[email protected]>
Co-authored-by: Ilya Lavrenov <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Wallas Henrique <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Li, Jiang <[email protected]>
Co-authored-by: Yan Ma <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: rasmith <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Maximilien de Bayser <[email protected]>
Co-authored-by: Maxime Fournioux <[email protected]>
Co-authored-by: Guspan Tanadi <[email protected]>
Co-authored-by: Ye (Charlotte) Qi <[email protected]>
Co-authored-by: yeq <[email protected]>
Co-authored-by: wangxiyuan <[email protected]>
Co-authored-by: Mengqing Cao <[email protected]>
Co-authored-by: Charles Frye <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Co-authored-by: Kunshang Ji <[email protected]>
Co-authored-by: cennn <[email protected]>
Co-authored-by: Kuntai Du <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: minmin <[email protected]>
Co-authored-by: Ren MinMin <[email protected]>
Co-authored-by: Travis Johnson <[email protected]>
Co-authored-by: Fred Reiss <[email protected]>
Co-authored-by: Sungjae Lee <[email protected]>
Co-authored-by: shaochangxu <[email protected]>
Co-authored-by: shaochangxu.scx <[email protected]>
Co-authored-by: Nicolò Lucchesi <[email protected]>
Co-authored-by: sixgod <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Rafael Vasquez <[email protected]>
Co-authored-by: Akshat Tripathi <[email protected]>
Co-authored-by: Oleg Mosalov <[email protected]>
Co-authored-by: Avshalom Manevich <[email protected]>
Co-authored-by: Yangcheng Li <[email protected]>
Co-authored-by: Siyuan Li <[email protected]>
Co-authored-by: Concurrensee <[email protected]>
Co-authored-by: Chenguang Li <[email protected]>
Co-authored-by: Alex Brooks <[email protected]>
Co-authored-by: Shanshan Shen <[email protected]>
Co-authored-by: elijah <[email protected]>
hongxiayang pushed a commit to ROCm/vllm that referenced this pull request Jan 15, 2025
* [Misc] Move weights mapper (vllm-project#11443)

Signed-off-by: Jee Jee Li <[email protected]>

* [Bugfix] Fix issues in CPU build Dockerfile. Fixes vllm-project#9182 (vllm-project#11435)

Signed-off-by: Yuan Tang <[email protected]>

* [Model] Automatic conversion of classification and reward models (vllm-project#11469)

Signed-off-by: DarkLight1337 <[email protected]>

* [V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor (vllm-project#11472)

* [Misc] Update disaggregation benchmark scripts and test logs (vllm-project#11456)

Signed-off-by: Jiaxin Shan <[email protected]>

* [Frontend] Enable decord to load video from base64 (vllm-project#11492)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Improve GitHub links (vllm-project#11491)

Signed-off-by: DarkLight1337 <[email protected]>

* [Misc] Move some multimodal utils to modality-specific modules (vllm-project#11494)

Signed-off-by: DarkLight1337 <[email protected]>

* Mypy checking for vllm/compilation (vllm-project#11496)

Signed-off-by: lucast2021 <[email protected]>
Co-authored-by: lucast2021 <[email protected]>

* [Misc][LoRA] Fix LoRA weight mapper (vllm-project#11495)

Signed-off-by: Jee Jee Li <[email protected]>

* [Doc] Add `QVQ` and `QwQ` to the list of supported models (vllm-project#11509)

Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [V1] Adding min tokens/repetition/presence/frequence penalties to V1 sampler (vllm-project#10681)

Signed-off-by: Sourashis Roy <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>

* [Model]  Modify MolmoForCausalLM MLP  (vllm-project#11510)

Signed-off-by: Jee Jee Li <[email protected]>

* [Misc] Add placeholder module (vllm-project#11501)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Add video example to openai client for multimodal (vllm-project#11521)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [1/N] API Server  (Remove Proxy) (vllm-project#11529)

* [Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization (vllm-project#11523)

Signed-off-by: mgoin <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Co-authored-by: simon-mo <[email protected]>
Co-authored-by: simon-mo <[email protected]>
Co-authored-by: HandH1998 <[email protected]>

* [2/N] API Server: Avoid ulimit footgun (vllm-project#11530)

* Deepseek v3 (vllm-project#11502)

Signed-off-by: mgoin <[email protected]>
Co-authored-by: mgoin <[email protected]>
Co-authored-by: robertgshaw2-neuralmagic <[email protected]>

* [Docs] Document Deepseek V3 support (vllm-project#11535)

Signed-off-by: simon-mo <[email protected]>

* Update openai_compatible_server.md (vllm-project#11536)

Co-authored-by: Simon Mo <[email protected]>

* [V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling (vllm-project#11394)

Signed-off-by: Woosuk Kwon <[email protected]>

* [V1] Fix yapf (vllm-project#11538)

Signed-off-by: Woosuk Kwon <[email protected]>

* [CI] Fix broken CI (vllm-project#11543)

* [misc] fix typing (vllm-project#11540)

Signed-off-by: youkaichao <[email protected]>

* [V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly (vllm-project#11534)

* [BugFix] Fix quantization for all other methods (vllm-project#11547)

* [Platform] Move model arch check to platform (vllm-project#11503)

Signed-off-by: Mengqing Cao <[email protected]>

* Update deploying_with_k8s.md with AMD ROCm GPU example (vllm-project#11465)

Signed-off-by: Alex He <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [Bugfix] Fix TeleChat2ForCausalLM weights mapper (vllm-project#11546)

Signed-off-by: Jee Jee Li <[email protected]>

* [Misc] Abstract the logic for reading and writing media content (vllm-project#11527)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc]  Add xgrammar in doc (vllm-project#11549)

Signed-off-by: ccjincong <[email protected]>

* [VLM] Support caching in merged multi-modal processor (vllm-project#11396)

Signed-off-by: DarkLight1337 <[email protected]>

* [MODEL] LoRA support for Jamba model (vllm-project#11209)

Signed-off-by: Erez Schwartz <[email protected]>

* [Misc]Add BNB quantization for MolmoForCausalLM  (vllm-project#11551)

Signed-off-by: Jee Jee Li <[email protected]>

* [Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix (vllm-project#11566)

Signed-off-by: Isotr0py <[email protected]>

* [Bugfix] Fix for ROCM compressed tensor support (vllm-project#11561)

* [Doc] Update mllama example based on official doc (vllm-project#11567)

Signed-off-by: Chen Zhang <[email protected]>

* [V1] [4/N] API Server: ZMQ/MP Utilities (vllm-project#11541)

* [Bugfix] Last token measurement fix (vllm-project#11376)

Signed-off-by: rajveerb <[email protected]>
Co-authored-by: Roger Wang <[email protected]>

* [Model] Support InternLM2 Reward models (vllm-project#11571)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [Model] Remove hardcoded image tokens ids from Pixtral (vllm-project#11582)

Signed-off-by: Roger Wang <[email protected]>

* [Hardware][AMD]: Replace HIPCC version with more precise ROCm version (vllm-project#11515)

Signed-off-by: hjwei <[email protected]>

* [V1][Minor] Set pin_memory=False for token_ids_cpu tensor (vllm-project#11581)

Signed-off-by: Woosuk Kwon <[email protected]>

* [Doc] Minor documentation fixes (vllm-project#11580)

Signed-off-by: DarkLight1337 <[email protected]>

* [bugfix] interleaving sliding window for cohere2 model (vllm-project#11583)

Signed-off-by: youkaichao <[email protected]>

* [V1] [5/N] API Server: unify `Detokenizer` and  `EngineCore` input (vllm-project#11545)

Signed-off-by: [email protected] <[email protected]>

* [Doc] Convert list tables to MyST (vllm-project#11594)

Signed-off-by: DarkLight1337 <[email protected]>

* [v1][bugfix] fix cudagraph with inplace buffer assignment (vllm-project#11596)

Signed-off-by: youkaichao <[email protected]>

* [Misc] KV cache transfer connector registry (vllm-project#11481)

Signed-off-by: KuntaiDu <[email protected]>

* Remove print statement in DeepseekScalingRotaryEmbedding (vllm-project#11604)

* [v1] fix compilation cache (vllm-project#11598)

Signed-off-by: youkaichao <[email protected]>

* [Docker] bump up neuron sdk v2.21 (vllm-project#11593)

Signed-off-by: Liangfu Chen <[email protected]>

* [Build][Kernel] Update CUTLASS to v3.6.0 (vllm-project#11607)

Signed-off-by: Tyler Michael Smith <[email protected]>

* [CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels (vllm-project#11618)

Signed-off-by: jiang1.li <[email protected]>

* [platforms] enable platform plugins (vllm-project#11602)

Signed-off-by: youkaichao <[email protected]>

* [VLM] Abstract out multi-modal data parsing in merged processor (vllm-project#11620)

Signed-off-by: DarkLight1337 <[email protected]>

* [V1] [6/N] API Server: Better Shutdown (vllm-project#11586)

* [Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseModel (vllm-project#11631)

* [benchmark] Remove dependency for H100 benchmark step (vllm-project#11572)

* [Model][LoRA]LoRA support added for MolmoForCausalLM (vllm-project#11439)

Signed-off-by: Matthias Vogler <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
Co-authored-by: Matthias Vogler <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>

* [Bugfix] Fix OpenAI parallel sampling when using xgrammar (vllm-project#11637)

Signed-off-by: mgoin <[email protected]>

* [Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) (vllm-project#6909)

Signed-off-by: Jee Jee Li <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>

* [Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. (vllm-project#11565)

* [V1] Simpify vision block hash for prefix caching by removing offset from hash (vllm-project#11646)

* [V1][VLM] V1 support for selected single-image models. (vllm-project#11632)

Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [Benchmark] Add benchmark script for CPU offloading  (vllm-project#11533)

Signed-off-by: ApostaC <[email protected]>
Co-authored-by: KuntaiDu <[email protected]>

* [Bugfix][Refactor] Unify model management in frontend (vllm-project#11660)

Signed-off-by: Joe Runde <[email protected]>

* [VLM] Add max-count checking in data parser for single image models (vllm-project#11661)

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>

* [Misc] Optimize Qwen2-VL LoRA test (vllm-project#11663)

Signed-off-by: Jee Jee Li <[email protected]>

* [Misc] Replace space with - in the file names (vllm-project#11667)

Signed-off-by: Lu Fang <[email protected]>

* [Doc] Fix typo (vllm-project#11666)

Signed-off-by: Kazuhiro Serizawa <[email protected]>

* [V1] Implement Cascade Attention (vllm-project#11635)

Signed-off-by: Woosuk Kwon <[email protected]>

* [VLM] Move supported limits and max tokens to merged multi-modal processor (vllm-project#11669)

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [VLM][Bugfix] Multi-modal processor compatible with V1 multi-input (vllm-project#11674)

Signed-off-by: DarkLight1337 <[email protected]>

* [mypy] Pass type checking in vllm/inputs (vllm-project#11680)

Signed-off-by: Tobias Pitters <[email protected]>

* [VLM] Merged multi-modal processor for LLaVA-NeXT (vllm-project#11682)

Signed-off-by: DarkLight1337 <[email protected]>

* According to vllm.EngineArgs, the name should be distributed_executor_backend (vllm-project#11689)

* [Bugfix] Free cross attention block table for preempted-for-recompute sequence group. (vllm-project#10013)

Signed-off-by: Kathy Yu <[email protected]>

* [V1][Minor] Optimize token_ids_cpu copy (vllm-project#11692)

Signed-off-by: Woosuk Kwon <[email protected]>

* [Bugfix] Change kv scaling factor by param json on nvidia gpu (vllm-project#11688)

Signed-off-by: bjmsong <[email protected]>
Co-authored-by: bjmsong <[email protected]>

* Resolve race conditions in Marlin kernel (vllm-project#11493)

Signed-off-by: wchen61 <[email protected]>

* [Misc] Minimum requirements for SageMaker compatibility (vllm-project#11576)

* Update default max_num_batch_tokens for chunked prefill (vllm-project#11694)

* [Bugfix] Check chain_speculative_sampling before calling it (vllm-project#11673)

Signed-off-by: Lu Fang <[email protected]>

* [perf-benchmark] Fix dependency for steps in benchmark pipeline (vllm-project#11710)

* [Model] Whisper model implementation (vllm-project#11280)

Co-authored-by: Aurick Qiao <[email protected]>

* [V1] Simplify Shutdown (vllm-project#11659)

* [Bugfix] Fix ColumnParallelLinearWithLoRA slice (vllm-project#11708)

Signed-off-by: ZincCat <[email protected]>

* [V1] Improve TP>1 Error Handling + Stack Trace (vllm-project#11721)

Co-authored-by: Tyler Michael Smith <[email protected]>

* [Misc]Add BNB quantization for Qwen2VL (vllm-project#11719)

Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* Update requirements-tpu.txt to support python 3.9 and 3.11 (vllm-project#11695)

Signed-off-by: mgoin <[email protected]>

* [V1] Chore: cruft removal (vllm-project#11724)

* [V1] log GPU blocks num for MultiprocExecutor (vllm-project#11656)

* Update tool_calling.md (vllm-project#11701)

* Update bnb.md with example for OpenAI (vllm-project#11718)

* [V1] Add `RayExecutor` support for `AsyncLLM` (api server) (vllm-project#11712)

* [V1] Add kv cache utils tests. (vllm-project#11513)

Signed-off-by: xcnick <[email protected]>

* [Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture (vllm-project#11233)

Signed-off-by: Yan Burman <[email protected]>
Signed-off-by: Ido Asraff <[email protected]>

* [VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-OneVision (vllm-project#11717)

Signed-off-by: DarkLight1337 <[email protected]>

* [Bugfix] Fix precision error in LLaVA-NeXT (vllm-project#11735)

Signed-off-by: DarkLight1337 <[email protected]>

* [Model] Remove unnecessary weight initialization logic (vllm-project#11736)

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [Bugfix][V1] Fix test_kv_cache_utils.py (vllm-project#11738)

Signed-off-by: Jee Jee Li <[email protected]>

* [MISC] Replace c10::optional with std::optional (vllm-project#11730)

Signed-off-by: Lu Fang <[email protected]>

* [distributed] remove pynccl's redundant stream (vllm-project#11744)

* fix: [doc] fix typo (vllm-project#11751)

Co-authored-by: Lancer <[email protected]>

* [Frontend] Improve `StreamingResponse` Exception Handling (vllm-project#11752)

* [distributed] remove pynccl's redundant change_state (vllm-project#11749)

* [Doc] [1/N] Reorganize Getting Started section (vllm-project#11645)

Signed-off-by: DarkLight1337 <[email protected]>

* [Bugfix] Remove block size constraint (vllm-project#11723)

* [V1] Add BlockTable class (vllm-project#11693)

Signed-off-by: Woosuk Kwon <[email protected]>

* [Misc] Fix typo for valid_tool_parses  (vllm-project#11753)

Signed-off-by: Rui Qiao <[email protected]>

* [V1] Refactor get_executor_cls (vllm-project#11754)

* [mypy] Forward pass function type hints in lora (vllm-project#11740)

Signed-off-by: lucast2021 <[email protected]>
Co-authored-by: lucast2021 <[email protected]>

* k8s-config: Update the secret to use stringData (vllm-project#11679)

Signed-off-by: Suraj Deshmukh <[email protected]>

* [VLM] Separate out profiling-related logic (vllm-project#11746)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc][2/N] Reorganize Models and Usage sections (vllm-project#11755)

Signed-off-by: DarkLight1337 <[email protected]>

* [Bugfix] Fix max image size for LLaVA-Onevision (vllm-project#11769)

Signed-off-by: Roger Wang <[email protected]>

* [doc] explain how to add interleaving sliding window support (vllm-project#11771)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix][V1] Fix molmo text-only inputs (vllm-project#11676)

Signed-off-by: Jee Jee Li <[email protected]>

* [Kernel] Move attn_type to Attention.__init__() (vllm-project#11690)

Signed-off-by: Chen Zhang <[email protected]>

* format

* [V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision (vllm-project#11685)

Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>

* deepseek overflow fix (#349)

* [Bugfix] Fix LLaVA-NeXT feature size precision error (for real) (vllm-project#11772)

Signed-off-by: DarkLight1337 <[email protected]>

* [Model] Future-proof Qwen2-Audio multi-modal processor (vllm-project#11776)

Signed-off-by: DarkLight1337 <[email protected]>

* [XPU] Make pp group initilized for pipeline-parallelism (vllm-project#11648)

Signed-off-by: yisheng <[email protected]>

* [Doc][3/N] Reorganize Serving section (vllm-project#11766)

Signed-off-by: DarkLight1337 <[email protected]>

* [Kernel][LoRA]Punica prefill  kernels fusion (vllm-project#11234)

Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Abatom <[email protected]>
Co-authored-by: Zhonghua Deng <[email protected]>

* [Bugfix] Update attention interface in `Whisper` (vllm-project#11784)

Signed-off-by: Roger Wang <[email protected]>

* [CI] Fix neuron CI and run offline tests (vllm-project#11779)

Signed-off-by: Liangfu Chen <[email protected]>

* fix init error for MessageQueue when n_local_reader is zero (vllm-project#11768)

* [Doc] Create a vulnerability management team (vllm-project#9925)

Signed-off-by: Russell Bryant <[email protected]>

* [CI][CPU] adding build number to docker image name (vllm-project#11788)

Signed-off-by: Yuan Zhou <[email protected]>

* [V1][Doc] Update V1 support for `LLaVa-NeXT-Video` (vllm-project#11798)

Signed-off-by: Roger Wang <[email protected]>

* [Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation (vllm-project#11800)

Signed-off-by: DarkLight1337 <[email protected]>

* [doc] add doc to explain how to use uv (vllm-project#11773)

Signed-off-by: youkaichao <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [V1] Support audio language models on V1 (vllm-project#11733)

Signed-off-by: Roger Wang <[email protected]>

* [doc] update how pip can install nightly wheels (vllm-project#11806)

Signed-off-by: youkaichao <[email protected]>

* [Doc] Add note to `gte-Qwen2` models (vllm-project#11808)

Signed-off-by: DarkLight1337 <[email protected]>

* [optimization] remove python function call for custom op (vllm-project#11750)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix] update the prefix for qwen2 (vllm-project#11795)

Co-authored-by: jiadi.jjd <[email protected]>

* [Doc]Add documentation for using EAGLE in vLLM (vllm-project#11417)

Signed-off-by: Sourashis Roy <[email protected]>

* [Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 (vllm-project#11794)

* [Doc] Group examples into categories (vllm-project#11782)

Signed-off-by: Harry Mellor <[email protected]>

* [Bugfix] Fix image input for Pixtral-HF (vllm-project#11741)

Signed-off-by: DarkLight1337 <[email protected]>

* [Misc] sort torch profiler table by kernel timing (vllm-project#11813)

* Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… (vllm-project#11824)

* Fixed docker build for ppc64le (vllm-project#11518)

Signed-off-by: Nishidha Panpaliya <[email protected]>

* [OpenVINO] Fixed Docker.openvino build (vllm-project#11732)

Signed-off-by: Ilya Lavrenov <[email protected]>

* [Bugfix] Add checks for LoRA and CPU offload (vllm-project#11810)

Signed-off-by: Jee Jee Li <[email protected]>

* [Docs] reorganize sponsorship page (vllm-project#11639)

Signed-off-by: simon-mo <[email protected]>

* [Bug] Fix pickling of `ModelConfig` when RunAI Model Streamer is used (vllm-project#11825)

Signed-off-by: DarkLight1337 <[email protected]>

* [misc] improve memory profiling (vllm-project#11809)

Signed-off-by: youkaichao <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [doc] update wheels url (vllm-project#11830)

Signed-off-by: youkaichao <[email protected]>

* [Docs] Update sponsor name: 'Novita' to 'Novita AI' (vllm-project#11833)

* [Hardware][Apple] Native support for macOS Apple Silicon (vllm-project#11696)

Signed-off-by: Wallas Santos <[email protected]>
Co-authored-by: Michael Goin <[email protected]>

* [torch.compile] consider relevant code in compilation cache (vllm-project#11614)

Signed-off-by: youkaichao <[email protected]>

* [VLM] Reorganize profiling/processing-related code (vllm-project#11812)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Move examples into categories (vllm-project#11840)

Signed-off-by: Harry Mellor <[email protected]>

* [Doc][4/N] Reorganize API Reference (vllm-project#11843)

Signed-off-by: DarkLight1337 <[email protected]>

* [CI/Build][Bugfix] Fix CPU CI image clean up (vllm-project#11836)

Signed-off-by: jiang1.li <[email protected]>

* [Bugfix][XPU] fix silu_and_mul (vllm-project#11823)

Signed-off-by: yan ma <[email protected]>

* [Misc] Move some model utils into vision file (vllm-project#11848)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Expand Multimodal API Reference (vllm-project#11852)

Signed-off-by: DarkLight1337 <[email protected]>

* [Misc]add some explanations for BlockHashType (vllm-project#11847)

* [TPU][Quantization] TPU `W8A8` (vllm-project#11785)

Co-authored-by: Woosuk Kwon <[email protected]>

* [Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models (vllm-project#11698)

Signed-off-by: Randall Smith <[email protected]>

* [Docs] Add Google Cloud Meetup (vllm-project#11864)

* Revert nccl changes (#351)

* Revert "[distributed] remove pynccl's redundant change_state (vllm-project#11749)"

This reverts commit 9e764e7.

* Revert "[distributed] remove pynccl's redundant stream (vllm-project#11744)"

This reverts commit 635b897.

* [CI] Turn on basic correctness tests for V1 (vllm-project#10864)

* treat do_lower_case in the same way as the sentence-transformers library (vllm-project#11815)

Signed-off-by: Max de Bayser <[email protected]>

* [Doc] Recommend uv and python 3.12 for quickstart guide (vllm-project#11849)

Signed-off-by: mgoin <[email protected]>

* [Misc] Move `print_*_once` from utils to logger (vllm-project#11298)

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Maxime Fournioux <[email protected]>
Co-authored-by: Maxime Fournioux <[email protected]>

* [Doc] Intended links Python multiprocessing library (vllm-project#11878)

* [perf]fix current stream (vllm-project#11870)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix] Override dunder methods of placeholder modules (vllm-project#11882)

Signed-off-by: DarkLight1337 <[email protected]>

* [Bugfix] fix beam search input errors and latency benchmark script (vllm-project#11875)

Signed-off-by: Ye Qi <[email protected]>
Co-authored-by: yeq <[email protected]>

* [Doc] Add model development API Reference (vllm-project#11884)

Signed-off-by: DarkLight1337 <[email protected]>

* [platform] Allow platform specify attention backend (vllm-project#11609)

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: Mengqing Cao <[email protected]>
Co-authored-by: Mengqing Cao <[email protected]>

* [ci]try to fix flaky multi-step tests (vllm-project#11894)

Signed-off-by: youkaichao <[email protected]>

* [Misc] Provide correct Pixtral-HF chat template (vllm-project#11891)

Signed-off-by: DarkLight1337 <[email protected]>

* fp8 support (#352)

Co-authored-by: Yida Wu <[email protected]>

* [Docs] Add Modal to deployment frameworks (vllm-project#11907)

* [Doc][5/N] Move Community and API Reference to the bottom (vllm-project#11896)

Signed-off-by: DarkLight1337 <[email protected]>
Co-authored-by: Simon Mo <[email protected]>

* [VLM] Enable tokenized inputs for merged multi-modal processor (vllm-project#11900)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Show default pooling method in a table (vllm-project#11904)

Signed-off-by: DarkLight1337 <[email protected]>

* [torch.compile] Hide KV cache behind torch.compile boundary (vllm-project#11677)

Signed-off-by: Chen Zhang <[email protected]>

* [Bugfix] Validate lora adapters to avoid crashing server (vllm-project#11727)

Signed-off-by: Joe Runde <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>

* [BUGFIX] Fix `UnspecifiedPlatform` package name (vllm-project#11916)

Signed-off-by: Kunshang Ji <[email protected]>

* [ci] fix gh200 tests (vllm-project#11919)

Signed-off-by: youkaichao <[email protected]>

* [misc] remove python function call for custom activation op (vllm-project#11885)

Co-authored-by: youkaichao <[email protected]>

* [platform] support pytorch custom op pluggable (vllm-project#11328)

Signed-off-by: wangxiyuan <[email protected]>

* Replace "online inference" with "online serving" (vllm-project#11923)

Signed-off-by: Harry Mellor <[email protected]>

* [ci] Fix sampler tests (vllm-project#11922)

Signed-off-by: youkaichao <[email protected]>

* [Doc] [1/N] Initial guide for merged multi-modal processor (vllm-project#11925)

Signed-off-by: DarkLight1337 <[email protected]>

* [platform] support custom torch.compile backend key (vllm-project#11318)

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Co-authored-by: youkaichao <[email protected]>

* [Doc] Rename offline inference examples (vllm-project#11927)

Signed-off-by: Harry Mellor <[email protected]>

* [Docs] Fix docstring in `get_ip` function (vllm-project#11932)

Signed-off-by: Kuntai Du <[email protected]>

* Doc fix in `benchmark_long_document_qa_throughput.py` (vllm-project#11933)

Signed-off-by: Kuntai Du <[email protected]>

* [Hardware][CPU] Support MOE models on x86 CPU (vllm-project#11831)

Signed-off-by: jiang1.li <[email protected]>

* [Misc] Clean up debug code in Deepseek-V3 (vllm-project#11930)

Signed-off-by: Isotr0py <[email protected]>

* [Misc] Update benchmark_prefix_caching.py fixed example usage (vllm-project#11920)

Signed-off-by: Ren MinMin <[email protected]>
Co-authored-by: Ren MinMin <[email protected]>

* [Bugfix] Check that number of images matches number of <|image|> tokens with mllama (vllm-project#11939)

Signed-off-by: Travis Johnson <[email protected]>

* [mypy] Fix mypy warnings in api_server.py (vllm-project#11941)

Signed-off-by: Fred Reiss <[email protected]>

* [ci] fix broken distributed-tests-4-gpus (vllm-project#11937)

Signed-off-by: youkaichao <[email protected]>

* [Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design (vllm-project#11672)

Signed-off-by: Sungjae Lee <[email protected]>

* [Bugfix] fused_experts_impl wrong compute type for float32 (vllm-project#11921)

Signed-off-by: shaochangxu.scx <[email protected]>
Co-authored-by: shaochangxu.scx <[email protected]>

* [CI/Build] Move model-specific multi-modal processing tests (vllm-project#11934)

Signed-off-by: DarkLight1337 <[email protected]>

* [Doc] Basic guide for writing unit tests for new models (vllm-project#11951)

Signed-off-by: DarkLight1337 <[email protected]>

* [Bugfix] Fix RobertaModel loading (vllm-project#11940)

Signed-off-by: NickLucche <[email protected]>

* [Model] Add cogagent model support vLLM (vllm-project#11742)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [V1] Avoid sending text prompt to core engine (vllm-project#11963)

Signed-off-by: Roger Wang <[email protected]>

* [CI/Build] Add markdown linter (vllm-project#11857)

Signed-off-by: Rafael Vasquez <[email protected]>

* [Model] Initialize support for Deepseek-VL2 models (vllm-project#11578)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [Hardware][CPU] Multi-LoRA implementation for the CPU backend (vllm-project#11100)

Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Oleg Mosalov <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
Co-authored-by: Oleg Mosalov <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [Hardware][TPU] workaround fix for MoE on TPU (vllm-project#11764)

* [V1][Core][1/n] Logging and Metrics (vllm-project#11962)

Signed-off-by: [email protected] <[email protected]>

* [Model] Support GGUF models newly added in `transformers` 4.46.0 (vllm-project#9685)

Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (vllm-project#11973)

Signed-off-by: [email protected] <[email protected]>

* [MISC] fix typo in kv transfer send recv test (vllm-project#11983)

* [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (vllm-project#11979)

* [CI][Spec Decode] fix: broken test for EAGLE model (vllm-project#11972)

Signed-off-by: Sungjae Lee <[email protected]>

* [Misc] Fix Deepseek V2 fp8 kv-scale remapping (vllm-project#11947)

Signed-off-by: Yida Wu <[email protected]>

* [Misc]Minor Changes about Worker (vllm-project#11555)

Signed-off-by: Chenguang Li <[email protected]>

* [platform] add ray_device_key (vllm-project#11948)

Signed-off-by: youkaichao <[email protected]>

* Fix Max Token ID for Qwen-VL-Chat (vllm-project#11980)

Signed-off-by: Alex-Brooks <[email protected]>

* [Kernel] unified_attention for Attention.forward (vllm-project#11967)

Signed-off-by: Chen Zhang <[email protected]>

* [Doc][V1] Update model implementation guide for V1 support (vllm-project#11998)

Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

* [Doc] Organise installation documentation into categories and tabs (vllm-project#11935)

Signed-off-by: Harry Mellor <[email protected]>

* [platform] add device_control env var (vllm-project#12009)

Signed-off-by: youkaichao <[email protected]>

* [Platform] Move get_punica_wrapper() function to Platform (vllm-project#11516)

Signed-off-by: Shanshan Shen <[email protected]>

* bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (vllm-project#11982)

Signed-off-by: elijah <[email protected]>

* Using list

* Revert "[misc] improve memory profiling (vllm-project#11809)"

This reverts commit 889e662.

* Multi-lingual P3L (#356)

* Commiting the *multilingual* P3L test.

* Created a *multi-lingual* P3L test.

* Making ruff happy.

* .

* Added a reference to the language-scripture Confluence table.

* Typo fixing.

* Harmonizing naming.

* Fixing comments in the header.

---------

Co-authored-by: Alexei V. Ivanov <[email protected]>
Co-authored-by: Gregory Shtrasberg <[email protected]>

* Trying to make scales work with compileable attention

* Docs lint

* linter formatting bug fixes

* inherit config file updates under fused_moe from main branch.

* match tests for the MOE layers with main.

---------

Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Yuan Tang <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Jiaxin Shan <[email protected]>
Signed-off-by: lucast2021 <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Sourashis Roy <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: Mengqing Cao <[email protected]>
Signed-off-by: Alex He <[email protected]>
Signed-off-by: ccjincong <[email protected]>
Signed-off-by: Erez Schwartz <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: rajveerb <[email protected]>
Signed-off-by: hjwei <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: KuntaiDu <[email protected]>
Signed-off-by: Liangfu Chen <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: jiang1.li <[email protected]>
Signed-off-by: Matthias Vogler <[email protected]>
Signed-off-by: ApostaC <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Kazuhiro Serizawa <[email protected]>
Signed-off-by: Tobias Pitters <[email protected]>
Signed-off-by: Kathy Yu <[email protected]>
Signed-off-by: bjmsong <[email protected]>
Signed-off-by: wchen61 <[email protected]>
Signed-off-by: ZincCat <[email protected]>
Signed-off-by: xcnick <[email protected]>
Signed-off-by: Yan Burman <[email protected]>
Signed-off-by: Ido Asraff <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Suraj Deshmukh <[email protected]>
Signed-off-by: yisheng <[email protected]>
Signed-off-by: Abatom <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Yuan Zhou <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Nishidha Panpaliya <[email protected]>
Signed-off-by: Ilya Lavrenov <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: yan ma <[email protected]>
Signed-off-by: Randall Smith <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Maxime Fournioux <[email protected]>
Signed-off-by: Ye Qi <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: Kunshang Ji <[email protected]>
Signed-off-by: Kuntai Du <[email protected]>
Signed-off-by: Ren MinMin <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Fred Reiss <[email protected]>
Signed-off-by: Sungjae Lee <[email protected]>
Signed-off-by: shaochangxu.scx <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: Rafael Vasquez <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Oleg Mosalov <[email protected]>
Signed-off-by: Yida Wu <[email protected]>
Signed-off-by: Chenguang Li <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Shanshan Shen <[email protected]>
Signed-off-by: elijah <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: Yuan Tang <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Rui Qiao <[email protected]>
Co-authored-by: Jiaxin Shan <[email protected]>
Co-authored-by: Lucas Tucker <[email protected]>
Co-authored-by: lucast2021 <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: sroy745 <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: simon-mo <[email protected]>
Co-authored-by: simon-mo <[email protected]>
Co-authored-by: HandH1998 <[email protected]>
Co-authored-by: robertgshaw2-neuralmagic <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: Mengqing Cao <[email protected]>
Co-authored-by: AlexHe99 <[email protected]>
Co-authored-by: Chen1022 <[email protected]>
Co-authored-by: ErezSC42 <[email protected]>
Co-authored-by: Selali <[email protected]>
Co-authored-by: Chen Zhang <[email protected]>
Co-authored-by: Rajveer Bachkaniwala <[email protected]>
Co-authored-by: hj-wei <[email protected]>
Co-authored-by: Kuntai Du <[email protected]>
Co-authored-by: Liangfu Chen <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Li, Jiang <[email protected]>
Co-authored-by: whyiug <[email protected]>
Co-authored-by: Kevin H. Luu <[email protected]>
Co-authored-by: Matthias Vogler <[email protected]>
Co-authored-by: Matthias Vogler <[email protected]>
Co-authored-by: John Giorgi <[email protected]>
Co-authored-by: sakunkun <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Yihua Cheng <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Lu Fang <[email protected]>
Co-authored-by: Kazuhiro Serizawa <[email protected]>
Co-authored-by: Tobias Pitters <[email protected]>
Co-authored-by: Chunyang Wen <[email protected]>
Co-authored-by: Kathy Yu <[email protected]>
Co-authored-by: bjmsong <[email protected]>
Co-authored-by: bjmsong <[email protected]>
Co-authored-by: wchen61 <[email protected]>
Co-authored-by: Nathan Azrak <[email protected]>
Co-authored-by: Sachin Varghese <[email protected]>
Co-authored-by: Aurick Qiao <[email protected]>
Co-authored-by: Aurick Qiao <[email protected]>
Co-authored-by: ZincCat <[email protected]>
Co-authored-by: WangErXiao <[email protected]>
Co-authored-by: Hust_YangXian <[email protected]>
Co-authored-by: Alberto Ferrer <[email protected]>
Co-authored-by: Kunshang Ji <[email protected]>
Co-authored-by: xcnick <[email protected]>
Co-authored-by: Yan Burman <[email protected]>
Co-authored-by: cennn <[email protected]>
Co-authored-by: Lancer <[email protected]>
Co-authored-by: Lancer <[email protected]>
Co-authored-by: Cody Yu <[email protected]>
Co-authored-by: Suraj Deshmukh <[email protected]>
Co-authored-by: Gregory Shtrasberg <[email protected]>
Co-authored-by: Concurrensee <[email protected]>
Co-authored-by: YiSheng5 <[email protected]>
Co-authored-by: Zhonghua Deng <[email protected]>
Co-authored-by: XiaobingZhang <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
Co-authored-by: Yuan <[email protected]>
Co-authored-by: jiangjiadi <[email protected]>
Co-authored-by: jiadi.jjd <[email protected]>
Co-authored-by: Gregory Shtrasberg <[email protected]>
Co-authored-by: Jie Fu (傅杰) <[email protected]>
Co-authored-by: Harry Mellor <[email protected]>
Co-authored-by: Divakar Verma <[email protected]>
Co-authored-by: Nishidha <[email protected]>
Co-authored-by: Ilya Lavrenov <[email protected]>
Co-authored-by: Wallas Henrique <[email protected]>
Co-authored-by: Yan Ma <[email protected]>
Co-authored-by: rasmith <[email protected]>
Co-authored-by: Maximilien de Bayser <[email protected]>
Co-authored-by: Maxime Fournioux <[email protected]>
Co-authored-by: Guspan Tanadi <[email protected]>
Co-authored-by: Ye (Charlotte) Qi <[email protected]>
Co-authored-by: yeq <[email protected]>
Co-authored-by: wangxiyuan <[email protected]>
Co-authored-by: Yida Wu <[email protected]>
Co-authored-by: Charles Frye <[email protected]>
Co-authored-by: minmin <[email protected]>
Co-authored-by: Ren MinMin <[email protected]>
Co-authored-by: Travis Johnson <[email protected]>
Co-authored-by: Fred Reiss <[email protected]>
Co-authored-by: Sungjae Lee <[email protected]>
Co-authored-by: shaochangxu <[email protected]>
Co-authored-by: shaochangxu.scx <[email protected]>
Co-authored-by: Nicolò Lucchesi <[email protected]>
Co-authored-by: sixgod <[email protected]>
Co-authored-by: Rafael Vasquez <[email protected]>
Co-authored-by: Akshat Tripathi <[email protected]>
Co-authored-by: Oleg Mosalov <[email protected]>
Co-authored-by: Avshalom Manevich <[email protected]>
Co-authored-by: Yangcheng Li <[email protected]>
Co-authored-by: Siyuan Li <[email protected]>
Co-authored-by: Chenguang Li <[email protected]>
Co-authored-by: Alex Brooks <[email protected]>
Co-authored-by: Shanshan Shen <[email protected]>
Co-authored-by: elijah <[email protected]>
Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]>
Co-authored-by: Alexei V. Ivanov <[email protected]>
Co-authored-by: vllmellm <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Model]: DeepSeek-VL2
7 participants