diff --git a/docs/source/getting_started/debugging.md b/docs/source/getting_started/debugging.md index 3b0029f2e88ce..19eb699572a08 100644 --- a/docs/source/getting_started/debugging.md +++ b/docs/source/getting_started/debugging.md @@ -197,4 +197,4 @@ if __name__ == '__main__': ## Known Issues - In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759). -- To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable ``NCCL_CUMEM_ENABLE=0`` to disable NCCL's ``cuMem`` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](gh-issue:5723#issuecomment-2554389656) . +- To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable `NCCL_CUMEM_ENABLE=0` to disable NCCL's `cuMem` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](gh-issue:5723#issuecomment-2554389656) . diff --git a/docs/source/getting_started/gaudi-installation.md b/docs/source/getting_started/gaudi-installation.md index 447bf98084a5d..1f2ee62860dec 100644 --- a/docs/source/getting_started/gaudi-installation.md +++ b/docs/source/getting_started/gaudi-installation.md @@ -141,26 +141,25 @@ Gaudi2 devices. Configurations that are not listed may or may not work. Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via `PT_HPU_LAZY_MODE` environment variable), and `--enforce-eager` flag. -```{eval-rst} -.. list-table:: vLLM execution modes - :widths: 25 25 50 - :header-rows: 1 - - * - ``PT_HPU_LAZY_MODE`` - - ``enforce_eager`` - - execution mode - * - 0 - - 0 - - torch.compile - * - 0 - - 1 - - PyTorch eager mode - * - 1 - - 0 - - HPU Graphs - * - 1 - - 1 - - PyTorch lazy mode +```{list-table} vLLM execution modes +:widths: 25 25 50 +:header-rows: 1 + +* - `PT_HPU_LAZY_MODE` + - `enforce_eager` + - execution mode +* - 0 + - 0 + - torch.compile +* - 0 + - 1 + - PyTorch eager mode +* - 1 + - 0 + - HPU Graphs +* - 1 + - 1 + - PyTorch lazy mode ``` ```{warning} diff --git a/docs/source/getting_started/tpu-installation.md b/docs/source/getting_started/tpu-installation.md index 17eded4a51fec..4d3ac541c90ce 100644 --- a/docs/source/getting_started/tpu-installation.md +++ b/docs/source/getting_started/tpu-installation.md @@ -68,33 +68,32 @@ gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \ --service-account SERVICE_ACCOUNT ``` -```{eval-rst} -.. list-table:: Parameter descriptions - :header-rows: 1 - - * - Parameter name - - Description - * - QUEUED_RESOURCE_ID - - The user-assigned ID of the queued resource request. - * - TPU_NAME - - The user-assigned name of the TPU which is created when the queued - resource request is allocated. - * - PROJECT_ID - - Your Google Cloud project - * - ZONE - - The GCP zone where you want to create your Cloud TPU. The value you use - depends on the version of TPUs you are using. For more information, see - `TPU regions and zones `_ - * - ACCELERATOR_TYPE - - The TPU version you want to use. Specify the TPU version, for example - `v5litepod-4` specifies a v5e TPU with 4 cores. For more information, - see `TPU versions `_. - * - RUNTIME_VERSION - - The TPU VM runtime version to use. For more information see `TPU VM images `_. - * - SERVICE_ACCOUNT - - The email address for your service account. You can find it in the IAM - Cloud Console under *Service Accounts*. For example: - `tpu-service-account@.iam.gserviceaccount.com` +```{list-table} Parameter descriptions +:header-rows: 1 + +* - Parameter name + - Description +* - QUEUED_RESOURCE_ID + - The user-assigned ID of the queued resource request. +* - TPU_NAME + - The user-assigned name of the TPU which is created when the queued + resource request is allocated. +* - PROJECT_ID + - Your Google Cloud project +* - ZONE + - The GCP zone where you want to create your Cloud TPU. The value you use + depends on the version of TPUs you are using. For more information, see + `TPU regions and zones `_ +* - ACCELERATOR_TYPE + - The TPU version you want to use. Specify the TPU version, for example + `v5litepod-4` specifies a v5e TPU with 4 cores. For more information, + see `TPU versions `_. +* - RUNTIME_VERSION + - The TPU VM runtime version to use. For more information see `TPU VM images `_. +* - SERVICE_ACCOUNT + - The email address for your service account. You can find it in the IAM + Cloud Console under *Service Accounts*. For example: + `tpu-service-account@.iam.gserviceaccount.com` ``` Connect to your TPU using SSH: diff --git a/docs/source/models/supported_models.md b/docs/source/models/supported_models.md index e11befbb8dd30..518505abeb2a9 100644 --- a/docs/source/models/supported_models.md +++ b/docs/source/models/supported_models.md @@ -72,291 +72,290 @@ See [this page](#generative-models) for more information on how to use generativ #### Text Generation (`--task generate`) -```{eval-rst} -.. list-table:: - :widths: 25 25 50 5 5 - :header-rows: 1 - - * - Architecture - - Models - - Example HF Models - - :ref:`LoRA ` - - :ref:`PP ` - * - :code:`AquilaForCausalLM` - - Aquila, Aquila2 - - :code:`BAAI/Aquila-7B`, :code:`BAAI/AquilaChat-7B`, etc. - - ✅︎ - - ✅︎ - * - :code:`ArcticForCausalLM` - - Arctic - - :code:`Snowflake/snowflake-arctic-base`, :code:`Snowflake/snowflake-arctic-instruct`, etc. - - - - ✅︎ - * - :code:`BaiChuanForCausalLM` - - Baichuan2, Baichuan - - :code:`baichuan-inc/Baichuan2-13B-Chat`, :code:`baichuan-inc/Baichuan-7B`, etc. - - ✅︎ - - ✅︎ - * - :code:`BloomForCausalLM` - - BLOOM, BLOOMZ, BLOOMChat - - :code:`bigscience/bloom`, :code:`bigscience/bloomz`, etc. - - - - ✅︎ - * - :code:`BartForConditionalGeneration` - - BART - - :code:`facebook/bart-base`, :code:`facebook/bart-large-cnn`, etc. - - - - - * - :code:`ChatGLMModel` - - ChatGLM - - :code:`THUDM/chatglm2-6b`, :code:`THUDM/chatglm3-6b`, etc. - - ✅︎ - - ✅︎ - * - :code:`CohereForCausalLM`, :code:`Cohere2ForCausalLM` - - Command-R - - :code:`CohereForAI/c4ai-command-r-v01`, :code:`CohereForAI/c4ai-command-r7b-12-2024`, etc. - - ✅︎ - - ✅︎ - * - :code:`DbrxForCausalLM` - - DBRX - - :code:`databricks/dbrx-base`, :code:`databricks/dbrx-instruct`, etc. - - - - ✅︎ - * - :code:`DeciLMForCausalLM` - - DeciLM - - :code:`Deci/DeciLM-7B`, :code:`Deci/DeciLM-7B-instruct`, etc. - - - - ✅︎ - * - :code:`DeepseekForCausalLM` - - DeepSeek - - :code:`deepseek-ai/deepseek-llm-67b-base`, :code:`deepseek-ai/deepseek-llm-7b-chat` etc. - - - - ✅︎ - * - :code:`DeepseekV2ForCausalLM` - - DeepSeek-V2 - - :code:`deepseek-ai/DeepSeek-V2`, :code:`deepseek-ai/DeepSeek-V2-Chat` etc. - - - - ✅︎ - * - :code:`DeepseekV3ForCausalLM` - - DeepSeek-V3 - - :code:`deepseek-ai/DeepSeek-V3-Base`, :code:`deepseek-ai/DeepSeek-V3` etc. - - - - ✅︎ - * - :code:`ExaoneForCausalLM` - - EXAONE-3 - - :code:`LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc. - - ✅︎ - - ✅︎ - * - :code:`FalconForCausalLM` - - Falcon - - :code:`tiiuae/falcon-7b`, :code:`tiiuae/falcon-40b`, :code:`tiiuae/falcon-rw-7b`, etc. - - - - ✅︎ - * - :code:`FalconMambaForCausalLM` - - FalconMamba - - :code:`tiiuae/falcon-mamba-7b`, :code:`tiiuae/falcon-mamba-7b-instruct`, etc. - - ✅︎ - - ✅︎ - * - :code:`GemmaForCausalLM` - - Gemma - - :code:`google/gemma-2b`, :code:`google/gemma-7b`, etc. - - ✅︎ - - ✅︎ - * - :code:`Gemma2ForCausalLM` - - Gemma2 - - :code:`google/gemma-2-9b`, :code:`google/gemma-2-27b`, etc. - - ✅︎ - - ✅︎ - * - :code:`GlmForCausalLM` - - GLM-4 - - :code:`THUDM/glm-4-9b-chat-hf`, etc. - - ✅︎ - - ✅︎ - * - :code:`GPT2LMHeadModel` - - GPT-2 - - :code:`gpt2`, :code:`gpt2-xl`, etc. - - - - ✅︎ - * - :code:`GPTBigCodeForCausalLM` - - StarCoder, SantaCoder, WizardCoder - - :code:`bigcode/starcoder`, :code:`bigcode/gpt_bigcode-santacoder`, :code:`WizardLM/WizardCoder-15B-V1.0`, etc. - - ✅︎ - - ✅︎ - * - :code:`GPTJForCausalLM` - - GPT-J - - :code:`EleutherAI/gpt-j-6b`, :code:`nomic-ai/gpt4all-j`, etc. - - - - ✅︎ - * - :code:`GPTNeoXForCausalLM` - - GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM - - :code:`EleutherAI/gpt-neox-20b`, :code:`EleutherAI/pythia-12b`, :code:`OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5`, :code:`databricks/dolly-v2-12b`, :code:`stabilityai/stablelm-tuned-alpha-7b`, etc. - - - - ✅︎ - * - :code:`GraniteForCausalLM` - - Granite 3.0, Granite 3.1, PowerLM - - :code:`ibm-granite/granite-3.0-2b-base`, :code:`ibm-granite/granite-3.1-8b-instruct`, :code:`ibm/PowerLM-3b`, etc. - - ✅︎ - - ✅︎ - * - :code:`GraniteMoeForCausalLM` - - Granite 3.0 MoE, PowerMoE - - :code:`ibm-granite/granite-3.0-1b-a400m-base`, :code:`ibm-granite/granite-3.0-3b-a800m-instruct`, :code:`ibm/PowerMoE-3b`, etc. - - ✅︎ - - ✅︎ - * - :code:`GritLM` - - GritLM - - :code:`parasail-ai/GritLM-7B-vllm`. - - ✅︎ - - ✅︎ - * - :code:`InternLMForCausalLM` - - InternLM - - :code:`internlm/internlm-7b`, :code:`internlm/internlm-chat-7b`, etc. - - ✅︎ - - ✅︎ - * - :code:`InternLM2ForCausalLM` - - InternLM2 - - :code:`internlm/internlm2-7b`, :code:`internlm/internlm2-chat-7b`, etc. - - ✅︎ - - ✅︎ - * - :code:`JAISLMHeadModel` - - Jais - - :code:`inceptionai/jais-13b`, :code:`inceptionai/jais-13b-chat`, :code:`inceptionai/jais-30b-v3`, :code:`inceptionai/jais-30b-chat-v3`, etc. - - - - ✅︎ - * - :code:`JambaForCausalLM` - - Jamba - - :code:`ai21labs/AI21-Jamba-1.5-Large`, :code:`ai21labs/AI21-Jamba-1.5-Mini`, :code:`ai21labs/Jamba-v0.1`, etc. - - ✅︎ - - ✅︎ - * - :code:`LlamaForCausalLM` - - Llama 3.1, Llama 3, Llama 2, LLaMA, Yi - - :code:`meta-llama/Meta-Llama-3.1-405B-Instruct`, :code:`meta-llama/Meta-Llama-3.1-70B`, :code:`meta-llama/Meta-Llama-3-70B-Instruct`, :code:`meta-llama/Llama-2-70b-hf`, :code:`01-ai/Yi-34B`, etc. - - ✅︎ - - ✅︎ - * - :code:`MambaForCausalLM` - - Mamba - - :code:`state-spaces/mamba-130m-hf`, :code:`state-spaces/mamba-790m-hf`, :code:`state-spaces/mamba-2.8b-hf`, etc. - - - - ✅︎ - * - :code:`MiniCPMForCausalLM` - - MiniCPM - - :code:`openbmb/MiniCPM-2B-sft-bf16`, :code:`openbmb/MiniCPM-2B-dpo-bf16`, :code:`openbmb/MiniCPM-S-1B-sft`, etc. - - ✅︎ - - ✅︎ - * - :code:`MiniCPM3ForCausalLM` - - MiniCPM3 - - :code:`openbmb/MiniCPM3-4B`, etc. - - ✅︎ - - ✅︎ - * - :code:`MistralForCausalLM` - - Mistral, Mistral-Instruct - - :code:`mistralai/Mistral-7B-v0.1`, :code:`mistralai/Mistral-7B-Instruct-v0.1`, etc. - - ✅︎ - - ✅︎ - * - :code:`MixtralForCausalLM` - - Mixtral-8x7B, Mixtral-8x7B-Instruct - - :code:`mistralai/Mixtral-8x7B-v0.1`, :code:`mistralai/Mixtral-8x7B-Instruct-v0.1`, :code:`mistral-community/Mixtral-8x22B-v0.1`, etc. - - ✅︎ - - ✅︎ - * - :code:`MPTForCausalLM` - - MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter - - :code:`mosaicml/mpt-7b`, :code:`mosaicml/mpt-7b-storywriter`, :code:`mosaicml/mpt-30b`, etc. - - - - ✅︎ - * - :code:`NemotronForCausalLM` - - Nemotron-3, Nemotron-4, Minitron - - :code:`nvidia/Minitron-8B-Base`, :code:`mgoin/Nemotron-4-340B-Base-hf-FP8`, etc. - - ✅︎ - - ✅︎ - * - :code:`OLMoForCausalLM` - - OLMo - - :code:`allenai/OLMo-1B-hf`, :code:`allenai/OLMo-7B-hf`, etc. - - - - ✅︎ - * - :code:`OLMo2ForCausalLM` - - OLMo2 - - :code:`allenai/OLMo2-7B-1124`, etc. - - - - ✅︎ - * - :code:`OLMoEForCausalLM` - - OLMoE - - :code:`allenai/OLMoE-1B-7B-0924`, :code:`allenai/OLMoE-1B-7B-0924-Instruct`, etc. - - ✅︎ - - ✅︎ - * - :code:`OPTForCausalLM` - - OPT, OPT-IML - - :code:`facebook/opt-66b`, :code:`facebook/opt-iml-max-30b`, etc. - - - - ✅︎ - * - :code:`OrionForCausalLM` - - Orion - - :code:`OrionStarAI/Orion-14B-Base`, :code:`OrionStarAI/Orion-14B-Chat`, etc. - - - - ✅︎ - * - :code:`PhiForCausalLM` - - Phi - - :code:`microsoft/phi-1_5`, :code:`microsoft/phi-2`, etc. - - ✅︎ - - ✅︎ - * - :code:`Phi3ForCausalLM` - - Phi-3 - - :code:`microsoft/Phi-3-mini-4k-instruct`, :code:`microsoft/Phi-3-mini-128k-instruct`, :code:`microsoft/Phi-3-medium-128k-instruct`, etc. - - ✅︎ - - ✅︎ - * - :code:`Phi3SmallForCausalLM` - - Phi-3-Small - - :code:`microsoft/Phi-3-small-8k-instruct`, :code:`microsoft/Phi-3-small-128k-instruct`, etc. - - - - ✅︎ - * - :code:`PhiMoEForCausalLM` - - Phi-3.5-MoE - - :code:`microsoft/Phi-3.5-MoE-instruct`, etc. - - ✅︎ - - ✅︎ - * - :code:`PersimmonForCausalLM` - - Persimmon - - :code:`adept/persimmon-8b-base`, :code:`adept/persimmon-8b-chat`, etc. - - - - ✅︎ - * - :code:`QWenLMHeadModel` - - Qwen - - :code:`Qwen/Qwen-7B`, :code:`Qwen/Qwen-7B-Chat`, etc. - - ✅︎ - - ✅︎ - * - :code:`Qwen2ForCausalLM` - - Qwen2 - - :code:`Qwen/QwQ-32B-Preview`, :code:`Qwen/Qwen2-7B-Instruct`, :code:`Qwen/Qwen2-7B`, etc. - - ✅︎ - - ✅︎ - * - :code:`Qwen2MoeForCausalLM` - - Qwen2MoE - - :code:`Qwen/Qwen1.5-MoE-A2.7B`, :code:`Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc. - - - - ✅︎ - * - :code:`StableLmForCausalLM` - - StableLM - - :code:`stabilityai/stablelm-3b-4e1t`, :code:`stabilityai/stablelm-base-alpha-7b-v2`, etc. - - - - ✅︎ - * - :code:`Starcoder2ForCausalLM` - - Starcoder2 - - :code:`bigcode/starcoder2-3b`, :code:`bigcode/starcoder2-7b`, :code:`bigcode/starcoder2-15b`, etc. - - - - ✅︎ - * - :code:`SolarForCausalLM` - - Solar Pro - - :code:`upstage/solar-pro-preview-instruct`, etc. - - ✅︎ - - ✅︎ - * - :code:`TeleChat2ForCausalLM` - - TeleChat2 - - :code:`TeleAI/TeleChat2-3B`, :code:`TeleAI/TeleChat2-7B`, :code:`TeleAI/TeleChat2-35B`, etc. - - ✅︎ - - ✅︎ - * - :code:`XverseForCausalLM` - - XVERSE - - :code:`xverse/XVERSE-7B-Chat`, :code:`xverse/XVERSE-13B-Chat`, :code:`xverse/XVERSE-65B-Chat`, etc. - - ✅︎ - - ✅︎ +```{list-table} +:widths: 25 25 50 5 5 +:header-rows: 1 + +* - Architecture + - Models + - Example HF Models + - [LoRA](#lora-adapter) + - [PP](#distributed-serving) +* - `AquilaForCausalLM` + - Aquila, Aquila2 + - `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. + - ✅︎ + - ✅︎ +* - `ArcticForCausalLM` + - Arctic + - `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc. + - + - ✅︎ +* - `BaiChuanForCausalLM` + - Baichuan2, Baichuan + - `baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc. + - ✅︎ + - ✅︎ +* - `BloomForCausalLM` + - BLOOM, BLOOMZ, BLOOMChat + - `bigscience/bloom`, `bigscience/bloomz`, etc. + - + - ✅︎ +* - `BartForConditionalGeneration` + - BART + - `facebook/bart-base`, `facebook/bart-large-cnn`, etc. + - + - +* - `ChatGLMModel` + - ChatGLM + - `THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, etc. + - ✅︎ + - ✅︎ +* - `CohereForCausalLM`, `Cohere2ForCausalLM` + - Command-R + - `CohereForAI/c4ai-command-r-v01`, `CohereForAI/c4ai-command-r7b-12-2024`, etc. + - ✅︎ + - ✅︎ +* - `DbrxForCausalLM` + - DBRX + - `databricks/dbrx-base`, `databricks/dbrx-instruct`, etc. + - + - ✅︎ +* - `DeciLMForCausalLM` + - DeciLM + - `Deci/DeciLM-7B`, `Deci/DeciLM-7B-instruct`, etc. + - + - ✅︎ +* - `DeepseekForCausalLM` + - DeepSeek + - `deepseek-ai/deepseek-llm-67b-base`, `deepseek-ai/deepseek-llm-7b-chat` etc. + - + - ✅︎ +* - `DeepseekV2ForCausalLM` + - DeepSeek-V2 + - `deepseek-ai/DeepSeek-V2`, `deepseek-ai/DeepSeek-V2-Chat` etc. + - + - ✅︎ +* - `DeepseekV3ForCausalLM` + - DeepSeek-V3 + - `deepseek-ai/DeepSeek-V3-Base`, `deepseek-ai/DeepSeek-V3` etc. + - + - ✅︎ +* - `ExaoneForCausalLM` + - EXAONE-3 + - `LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc. + - ✅︎ + - ✅︎ +* - `FalconForCausalLM` + - Falcon + - `tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc. + - + - ✅︎ +* - `FalconMambaForCausalLM` + - FalconMamba + - `tiiuae/falcon-mamba-7b`, `tiiuae/falcon-mamba-7b-instruct`, etc. + - ✅︎ + - ✅︎ +* - `GemmaForCausalLM` + - Gemma + - `google/gemma-2b`, `google/gemma-7b`, etc. + - ✅︎ + - ✅︎ +* - `Gemma2ForCausalLM` + - Gemma2 + - `google/gemma-2-9b`, `google/gemma-2-27b`, etc. + - ✅︎ + - ✅︎ +* - `GlmForCausalLM` + - GLM-4 + - `THUDM/glm-4-9b-chat-hf`, etc. + - ✅︎ + - ✅︎ +* - `GPT2LMHeadModel` + - GPT-2 + - `gpt2`, `gpt2-xl`, etc. + - + - ✅︎ +* - `GPTBigCodeForCausalLM` + - StarCoder, SantaCoder, WizardCoder + - `bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, `WizardLM/WizardCoder-15B-V1.0`, etc. + - ✅︎ + - ✅︎ +* - `GPTJForCausalLM` + - GPT-J + - `EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc. + - + - ✅︎ +* - `GPTNeoXForCausalLM` + - GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM + - `EleutherAI/gpt-neox-20b`, `EleutherAI/pythia-12b`, `OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc. + - + - ✅︎ +* - `GraniteForCausalLM` + - Granite 3.0, Granite 3.1, PowerLM + - `ibm-granite/granite-3.0-2b-base`, `ibm-granite/granite-3.1-8b-instruct`, `ibm/PowerLM-3b`, etc. + - ✅︎ + - ✅︎ +* - `GraniteMoeForCausalLM` + - Granite 3.0 MoE, PowerMoE + - `ibm-granite/granite-3.0-1b-a400m-base`, `ibm-granite/granite-3.0-3b-a800m-instruct`, `ibm/PowerMoE-3b`, etc. + - ✅︎ + - ✅︎ +* - `GritLM` + - GritLM + - `parasail-ai/GritLM-7B-vllm`. + - ✅︎ + - ✅︎ +* - `InternLMForCausalLM` + - InternLM + - `internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc. + - ✅︎ + - ✅︎ +* - `InternLM2ForCausalLM` + - InternLM2 + - `internlm/internlm2-7b`, `internlm/internlm2-chat-7b`, etc. + - ✅︎ + - ✅︎ +* - `JAISLMHeadModel` + - Jais + - `inceptionai/jais-13b`, `inceptionai/jais-13b-chat`, `inceptionai/jais-30b-v3`, `inceptionai/jais-30b-chat-v3`, etc. + - + - ✅︎ +* - `JambaForCausalLM` + - Jamba + - `ai21labs/AI21-Jamba-1.5-Large`, `ai21labs/AI21-Jamba-1.5-Mini`, `ai21labs/Jamba-v0.1`, etc. + - ✅︎ + - ✅︎ +* - `LlamaForCausalLM` + - Llama 3.1, Llama 3, Llama 2, LLaMA, Yi + - `meta-llama/Meta-Llama-3.1-405B-Instruct`, `meta-llama/Meta-Llama-3.1-70B`, `meta-llama/Meta-Llama-3-70B-Instruct`, `meta-llama/Llama-2-70b-hf`, `01-ai/Yi-34B`, etc. + - ✅︎ + - ✅︎ +* - `MambaForCausalLM` + - Mamba + - `state-spaces/mamba-130m-hf`, `state-spaces/mamba-790m-hf`, `state-spaces/mamba-2.8b-hf`, etc. + - + - ✅︎ +* - `MiniCPMForCausalLM` + - MiniCPM + - `openbmb/MiniCPM-2B-sft-bf16`, `openbmb/MiniCPM-2B-dpo-bf16`, `openbmb/MiniCPM-S-1B-sft`, etc. + - ✅︎ + - ✅︎ +* - `MiniCPM3ForCausalLM` + - MiniCPM3 + - `openbmb/MiniCPM3-4B`, etc. + - ✅︎ + - ✅︎ +* - `MistralForCausalLM` + - Mistral, Mistral-Instruct + - `mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc. + - ✅︎ + - ✅︎ +* - `MixtralForCausalLM` + - Mixtral-8x7B, Mixtral-8x7B-Instruct + - `mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, `mistral-community/Mixtral-8x22B-v0.1`, etc. + - ✅︎ + - ✅︎ +* - `MPTForCausalLM` + - MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter + - `mosaicml/mpt-7b`, `mosaicml/mpt-7b-storywriter`, `mosaicml/mpt-30b`, etc. + - + - ✅︎ +* - `NemotronForCausalLM` + - Nemotron-3, Nemotron-4, Minitron + - `nvidia/Minitron-8B-Base`, `mgoin/Nemotron-4-340B-Base-hf-FP8`, etc. + - ✅︎ + - ✅︎ +* - `OLMoForCausalLM` + - OLMo + - `allenai/OLMo-1B-hf`, `allenai/OLMo-7B-hf`, etc. + - + - ✅︎ +* - `OLMo2ForCausalLM` + - OLMo2 + - `allenai/OLMo2-7B-1124`, etc. + - + - ✅︎ +* - `OLMoEForCausalLM` + - OLMoE + - `allenai/OLMoE-1B-7B-0924`, `allenai/OLMoE-1B-7B-0924-Instruct`, etc. + - ✅︎ + - ✅︎ +* - `OPTForCausalLM` + - OPT, OPT-IML + - `facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc. + - + - ✅︎ +* - `OrionForCausalLM` + - Orion + - `OrionStarAI/Orion-14B-Base`, `OrionStarAI/Orion-14B-Chat`, etc. + - + - ✅︎ +* - `PhiForCausalLM` + - Phi + - `microsoft/phi-1_5`, `microsoft/phi-2`, etc. + - ✅︎ + - ✅︎ +* - `Phi3ForCausalLM` + - Phi-3 + - `microsoft/Phi-3-mini-4k-instruct`, `microsoft/Phi-3-mini-128k-instruct`, `microsoft/Phi-3-medium-128k-instruct`, etc. + - ✅︎ + - ✅︎ +* - `Phi3SmallForCausalLM` + - Phi-3-Small + - `microsoft/Phi-3-small-8k-instruct`, `microsoft/Phi-3-small-128k-instruct`, etc. + - + - ✅︎ +* - `PhiMoEForCausalLM` + - Phi-3.5-MoE + - `microsoft/Phi-3.5-MoE-instruct`, etc. + - ✅︎ + - ✅︎ +* - `PersimmonForCausalLM` + - Persimmon + - `adept/persimmon-8b-base`, `adept/persimmon-8b-chat`, etc. + - + - ✅︎ +* - `QWenLMHeadModel` + - Qwen + - `Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc. + - ✅︎ + - ✅︎ +* - `Qwen2ForCausalLM` + - Qwen2 + - `Qwen/QwQ-32B-Preview`, `Qwen/Qwen2-7B-Instruct`, `Qwen/Qwen2-7B`, etc. + - ✅︎ + - ✅︎ +* - `Qwen2MoeForCausalLM` + - Qwen2MoE + - `Qwen/Qwen1.5-MoE-A2.7B`, `Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc. + - + - ✅︎ +* - `StableLmForCausalLM` + - StableLM + - `stabilityai/stablelm-3b-4e1t`, `stabilityai/stablelm-base-alpha-7b-v2`, etc. + - + - ✅︎ +* - `Starcoder2ForCausalLM` + - Starcoder2 + - `bigcode/starcoder2-3b`, `bigcode/starcoder2-7b`, `bigcode/starcoder2-15b`, etc. + - + - ✅︎ +* - `SolarForCausalLM` + - Solar Pro + - `upstage/solar-pro-preview-instruct`, etc. + - ✅︎ + - ✅︎ +* - `TeleChat2ForCausalLM` + - TeleChat2 + - `TeleAI/TeleChat2-3B`, `TeleAI/TeleChat2-7B`, `TeleAI/TeleChat2-35B`, etc. + - ✅︎ + - ✅︎ +* - `XverseForCausalLM` + - XVERSE + - `xverse/XVERSE-7B-Chat`, `xverse/XVERSE-13B-Chat`, `xverse/XVERSE-65B-Chat`, etc. + - ✅︎ + - ✅︎ ``` ```{note} @@ -374,51 +373,50 @@ you should explicitly specify the task type to ensure that the model is used in #### Text Embedding (`--task embed`) -```{eval-rst} -.. list-table:: - :widths: 25 25 50 5 5 - :header-rows: 1 - - * - Architecture - - Models - - Example HF Models - - :ref:`LoRA ` - - :ref:`PP ` - * - :code:`BertModel` - - BERT-based - - :code:`BAAI/bge-base-en-v1.5`, etc. - - - - - * - :code:`Gemma2Model` - - Gemma2-based - - :code:`BAAI/bge-multilingual-gemma2`, etc. - - - - ✅︎ - * - :code:`GritLM` - - GritLM - - :code:`parasail-ai/GritLM-7B-vllm`. - - ✅︎ - - ✅︎ - * - :code:`LlamaModel`, :code:`LlamaForCausalLM`, :code:`MistralModel`, etc. - - Llama-based - - :code:`intfloat/e5-mistral-7b-instruct`, etc. - - ✅︎ - - ✅︎ - * - :code:`Qwen2Model`, :code:`Qwen2ForCausalLM` - - Qwen2-based - - :code:`ssmits/Qwen2-7B-Instruct-embed-base` (see note), :code:`Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. - - ✅︎ - - ✅︎ - * - :code:`RobertaModel`, :code:`RobertaForMaskedLM` - - RoBERTa-based - - :code:`sentence-transformers/all-roberta-large-v1`, :code:`sentence-transformers/all-roberta-large-v1`, etc. - - - - - * - :code:`XLMRobertaModel` - - XLM-RoBERTa-based - - :code:`intfloat/multilingual-e5-large`, etc. - - - - +```{list-table} +:widths: 25 25 50 5 5 +:header-rows: 1 + +* - Architecture + - Models + - Example HF Models + - [LoRA](#lora-adapter) + - [PP](#distributed-serving) +* - `BertModel` + - BERT-based + - `BAAI/bge-base-en-v1.5`, etc. + - + - +* - `Gemma2Model` + - Gemma2-based + - `BAAI/bge-multilingual-gemma2`, etc. + - + - ✅︎ +* - `GritLM` + - GritLM + - `parasail-ai/GritLM-7B-vllm`. + - ✅︎ + - ✅︎ +* - `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. + - Llama-based + - `intfloat/e5-mistral-7b-instruct`, etc. + - ✅︎ + - ✅︎ +* - `Qwen2Model`, `Qwen2ForCausalLM` + - Qwen2-based + - `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. + - ✅︎ + - ✅︎ +* - `RobertaModel`, `RobertaForMaskedLM` + - RoBERTa-based + - `sentence-transformers/all-roberta-large-v1`, `sentence-transformers/all-roberta-large-v1`, etc. + - + - +* - `XLMRobertaModel` + - XLM-RoBERTa-based + - `intfloat/multilingual-e5-large`, etc. + - + - ``` ```{note} @@ -440,31 +438,30 @@ of the whole prompt are extracted from the normalized hidden state corresponding #### Reward Modeling (`--task reward`) -```{eval-rst} -.. list-table:: - :widths: 25 25 50 5 5 - :header-rows: 1 - - * - Architecture - - Models - - Example HF Models - - :ref:`LoRA ` - - :ref:`PP ` - * - :code:`InternLM2ForRewardModel` - - InternLM2-based - - :code:`internlm/internlm2-1_8b-reward`, :code:`internlm/internlm2-7b-reward`, etc. - - ✅︎ - - ✅︎ - * - :code:`LlamaForCausalLM` - - Llama-based - - :code:`peiyi9979/math-shepherd-mistral-7b-prm`, etc. - - ✅︎ - - ✅︎ - * - :code:`Qwen2ForRewardModel` - - Qwen2-based - - :code:`Qwen/Qwen2.5-Math-RM-72B`, etc. - - ✅︎ - - ✅︎ +```{list-table} +:widths: 25 25 50 5 5 +:header-rows: 1 + +* - Architecture + - Models + - Example HF Models + - [LoRA](#lora-adapter) + - [PP](#distributed-serving) +* - `InternLM2ForRewardModel` + - InternLM2-based + - `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. + - ✅︎ + - ✅︎ +* - `LlamaForCausalLM` + - Llama-based + - `peiyi9979/math-shepherd-mistral-7b-prm`, etc. + - ✅︎ + - ✅︎ +* - `Qwen2ForRewardModel` + - Qwen2-based + - `Qwen/Qwen2.5-Math-RM-72B`, etc. + - ✅︎ + - ✅︎ ``` If your model is not in the above list, we will try to automatically convert the model using @@ -477,26 +474,25 @@ e.g.: {code}`--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 1 #### Classification (`--task classify`) -```{eval-rst} -.. list-table:: - :widths: 25 25 50 5 5 - :header-rows: 1 - - * - Architecture - - Models - - Example HF Models - - :ref:`LoRA ` - - :ref:`PP ` - * - :code:`JambaForSequenceClassification` - - Jamba - - :code:`ai21labs/Jamba-tiny-reward-dev`, etc. - - ✅︎ - - ✅︎ - * - :code:`Qwen2ForSequenceClassification` - - Qwen2-based - - :code:`jason9693/Qwen2.5-1.5B-apeach`, etc. - - ✅︎ - - ✅︎ +```{list-table} +:widths: 25 25 50 5 5 +:header-rows: 1 + +* - Architecture + - Models + - Example HF Models + - [LoRA](#lora-adapter) + - [PP](#distributed-serving) +* - `JambaForSequenceClassification` + - Jamba + - `ai21labs/Jamba-tiny-reward-dev`, etc. + - ✅︎ + - ✅︎ +* - `Qwen2ForSequenceClassification` + - Qwen2-based + - `jason9693/Qwen2.5-1.5B-apeach`, etc. + - ✅︎ + - ✅︎ ``` If your model is not in the above list, we will try to automatically convert the model using @@ -504,31 +500,30 @@ If your model is not in the above list, we will try to automatically convert the #### Sentence Pair Scoring (`--task score`) -```{eval-rst} -.. list-table:: - :widths: 25 25 50 5 5 - :header-rows: 1 - - * - Architecture - - Models - - Example HF Models - - :ref:`LoRA ` - - :ref:`PP ` - * - :code:`BertForSequenceClassification` - - BERT-based - - :code:`cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. - - - - - * - :code:`RobertaForSequenceClassification` - - RoBERTa-based - - :code:`cross-encoder/quora-roberta-base`, etc. - - - - - * - :code:`XLMRobertaForSequenceClassification` - - XLM-RoBERTa-based - - :code:`BAAI/bge-reranker-v2-m3`, etc. - - - - +```{list-table} +:widths: 25 25 50 5 5 +:header-rows: 1 + +* - Architecture + - Models + - Example HF Models + - [LoRA](#lora-adapter) + - [PP](#distributed-serving) +* - `BertForSequenceClassification` + - BERT-based + - `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. + - + - +* - `RobertaForSequenceClassification` + - RoBERTa-based + - `cross-encoder/quora-roberta-base`, etc. + - + - +* - `XLMRobertaForSequenceClassification` + - XLM-RoBERTa-based + - `BAAI/bge-reranker-v2-m3`, etc. + - + - ``` (supported-mm-models)= @@ -558,186 +553,182 @@ See [this page](#generative-models) for more information on how to use generativ #### Text Generation (`--task generate`) -```{eval-rst} -.. list-table:: - :widths: 25 25 15 20 5 5 5 - :header-rows: 1 - - * - Architecture - - Models - - Inputs - - Example HF Models - - :ref:`LoRA ` - - :ref:`PP ` - - V1 - * - :code:`AriaForConditionalGeneration` - - Aria - - T + I - - :code:`rhymes-ai/Aria` - - - - ✅︎ - - - * - :code:`Blip2ForConditionalGeneration` - - BLIP-2 - - T + I\ :sup:`E` - - :code:`Salesforce/blip2-opt-2.7b`, :code:`Salesforce/blip2-opt-6.7b`, etc. - - - - ✅︎ - - - * - :code:`ChameleonForConditionalGeneration` - - Chameleon - - T + I - - :code:`facebook/chameleon-7b` etc. - - - - ✅︎ - - - * - :code:`FuyuForCausalLM` - - Fuyu - - T + I - - :code:`adept/fuyu-8b` etc. - - - - ✅︎ - - - * - :code:`ChatGLMModel` - - GLM-4V - - T + I - - :code:`THUDM/glm-4v-9b` etc. - - ✅︎ - - ✅︎ - - - * - :code:`H2OVLChatModel` - - H2OVL - - T + I\ :sup:`E+` - - :code:`h2oai/h2ovl-mississippi-800m`, :code:`h2oai/h2ovl-mississippi-2b`, etc. - - - - ✅︎ - - - * - :code:`Idefics3ForConditionalGeneration` - - Idefics3 - - T + I - - :code:`HuggingFaceM4/Idefics3-8B-Llama3` etc. - - ✅︎ - - - - - * - :code:`InternVLChatModel` - - InternVL 2.5, Mono-InternVL, InternVL 2.0 - - T + I\ :sup:`E+` - - :code:`OpenGVLab/InternVL2_5-4B`, :code:`OpenGVLab/Mono-InternVL-2B`, :code:`OpenGVLab/InternVL2-4B`, etc. - - - - ✅︎ - - ✅︎ - * - :code:`LlavaForConditionalGeneration` - - LLaVA-1.5 - - T + I\ :sup:`E+` - - :code:`llava-hf/llava-1.5-7b-hf`, :code:`TIGER-Lab/Mantis-8B-siglip-llama3` (see note), etc. - - - - ✅︎ - - ✅︎ - * - :code:`LlavaNextForConditionalGeneration` - - LLaVA-NeXT - - T + I\ :sup:`E+` - - :code:`llava-hf/llava-v1.6-mistral-7b-hf`, :code:`llava-hf/llava-v1.6-vicuna-7b-hf`, etc. - - - - ✅︎ - - - * - :code:`LlavaNextVideoForConditionalGeneration` - - LLaVA-NeXT-Video - - T + V - - :code:`llava-hf/LLaVA-NeXT-Video-7B-hf`, etc. - - - - ✅︎ - - - * - :code:`LlavaOnevisionForConditionalGeneration` - - LLaVA-Onevision - - T + I\ :sup:`+` + V\ :sup:`+` - - :code:`llava-hf/llava-onevision-qwen2-7b-ov-hf`, :code:`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc. - - - - ✅︎ - - - * - :code:`MiniCPMV` - - MiniCPM-V - - T + I\ :sup:`E+` - - :code:`openbmb/MiniCPM-V-2` (see note), :code:`openbmb/MiniCPM-Llama3-V-2_5`, :code:`openbmb/MiniCPM-V-2_6`, etc. - - ✅︎ - - ✅︎ - - - * - :code:`MllamaForConditionalGeneration` - - Llama 3.2 - - T + I\ :sup:`+` - - :code:`meta-llama/Llama-3.2-90B-Vision-Instruct`, :code:`meta-llama/Llama-3.2-11B-Vision`, etc. - - - - - - - * - :code:`MolmoForCausalLM` - - Molmo - - T + I - - :code:`allenai/Molmo-7B-D-0924`, :code:`allenai/Molmo-72B-0924`, etc. - - - - ✅︎ - - ✅︎ - * - :code:`NVLM_D_Model` - - NVLM-D 1.0 - - T + I\ :sup:`E+` - - :code:`nvidia/NVLM-D-72B`, etc. - - - - ✅︎ - - ✅︎ - * - :code:`PaliGemmaForConditionalGeneration` - - PaliGemma, PaliGemma 2 - - T + I\ :sup:`E` - - :code:`google/paligemma-3b-pt-224`, :code:`google/paligemma-3b-mix-224`, :code:`google/paligemma2-3b-ft-docci-448`, etc. - - - - ✅︎ - - - * - :code:`Phi3VForCausalLM` - - Phi-3-Vision, Phi-3.5-Vision - - T + I\ :sup:`E+` - - :code:`microsoft/Phi-3-vision-128k-instruct`, :code:`microsoft/Phi-3.5-vision-instruct` etc. - - - - ✅︎ - - ✅︎ - * - :code:`PixtralForConditionalGeneration` - - Pixtral - - T + I\ :sup:`+` - - :code:`mistralai/Pixtral-12B-2409`, :code:`mistral-community/pixtral-12b` etc. - - - - ✅︎ - - ✅︎ - * - :code:`QWenLMHeadModel` - - Qwen-VL - - T + I\ :sup:`E+` - - :code:`Qwen/Qwen-VL`, :code:`Qwen/Qwen-VL-Chat`, etc. - - ✅︎ - - ✅︎ - - - * - :code:`Qwen2AudioForConditionalGeneration` - - Qwen2-Audio - - T + A\ :sup:`+` - - :code:`Qwen/Qwen2-Audio-7B-Instruct` - - - - ✅︎ - - - * - :code:`Qwen2VLForConditionalGeneration` - - Qwen2-VL - - T + I\ :sup:`E+` + V\ :sup:`E+` - - :code:`Qwen/QVQ-72B-Preview`, :code:`Qwen/Qwen2-VL-7B-Instruct`, :code:`Qwen/Qwen2-VL-72B-Instruct`, etc. - - ✅︎ - - ✅︎ - - - * - :code:`UltravoxModel` - - Ultravox - - T + A\ :sup:`E+` - - :code:`fixie-ai/ultravox-v0_3` - - - - ✅︎ - - +```{list-table} +:widths: 25 25 15 20 5 5 5 +:header-rows: 1 + +* - Architecture + - Models + - Inputs + - Example HF Models + - [LoRA](#lora-adapter) + - [PP](#distributed-serving) + - [V1](gh-issue:8779) +* - `AriaForConditionalGeneration` + - Aria + - T + I + - `rhymes-ai/Aria` + - + - ✅︎ + - +* - `Blip2ForConditionalGeneration` + - BLIP-2 + - T + IE + - `Salesforce/blip2-opt-2.7b`, `Salesforce/blip2-opt-6.7b`, etc. + - + - ✅︎ + - +* - `ChameleonForConditionalGeneration` + - Chameleon + - T + I + - `facebook/chameleon-7b` etc. + - + - ✅︎ + - +* - `FuyuForCausalLM` + - Fuyu + - T + I + - `adept/fuyu-8b` etc. + - + - ✅︎ + - +* - `ChatGLMModel` + - GLM-4V + - T + I + - `THUDM/glm-4v-9b` etc. + - ✅︎ + - ✅︎ + - +* - `H2OVLChatModel` + - H2OVL + - T + IE+ + - `h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc. + - + - ✅︎ + - +* - `Idefics3ForConditionalGeneration` + - Idefics3 + - T + I + - `HuggingFaceM4/Idefics3-8B-Llama3` etc. + - ✅︎ + - + - +* - `InternVLChatModel` + - InternVL 2.5, Mono-InternVL, InternVL 2.0 + - T + IE+ + - `OpenGVLab/InternVL2_5-4B`, `OpenGVLab/Mono-InternVL-2B`, `OpenGVLab/InternVL2-4B`, etc. + - + - ✅︎ + - ✅︎ +* - `LlavaForConditionalGeneration` + - LLaVA-1.5 + - T + IE+ + - `llava-hf/llava-1.5-7b-hf`, `TIGER-Lab/Mantis-8B-siglip-llama3` (see note), etc. + - + - ✅︎ + - ✅︎ +* - `LlavaNextForConditionalGeneration` + - LLaVA-NeXT + - T + IE+ + - `llava-hf/llava-v1.6-mistral-7b-hf`, `llava-hf/llava-v1.6-vicuna-7b-hf`, etc. + - + - ✅︎ + - +* - `LlavaNextVideoForConditionalGeneration` + - LLaVA-NeXT-Video + - T + V + - `llava-hf/LLaVA-NeXT-Video-7B-hf`, etc. + - + - ✅︎ + - +* - `LlavaOnevisionForConditionalGeneration` + - LLaVA-Onevision + - T + I+ + V+ + - `llava-hf/llava-onevision-qwen2-7b-ov-hf`, `llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc. + - + - ✅︎ + - +* - `MiniCPMV` + - MiniCPM-V + - T + IE+ + - `openbmb/MiniCPM-V-2` (see note), `openbmb/MiniCPM-Llama3-V-2_5`, `openbmb/MiniCPM-V-2_6`, etc. + - ✅︎ + - ✅︎ + - +* - `MllamaForConditionalGeneration` + - Llama 3.2 + - T + I+ + - `meta-llama/Llama-3.2-90B-Vision-Instruct`, `meta-llama/Llama-3.2-11B-Vision`, etc. + - + - + - +* - `MolmoForCausalLM` + - Molmo + - T + I + - `allenai/Molmo-7B-D-0924`, `allenai/Molmo-72B-0924`, etc. + - + - ✅︎ + - ✅︎ +* - `NVLM_D_Model` + - NVLM-D 1.0 + - T + IE+ + - `nvidia/NVLM-D-72B`, etc. + - + - ✅︎ + - ✅︎ +* - `PaliGemmaForConditionalGeneration` + - PaliGemma, PaliGemma 2 + - T + IE + - `google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc. + - + - ✅︎ + - +* - `Phi3VForCausalLM` + - Phi-3-Vision, Phi-3.5-Vision + - T + IE+ + - `microsoft/Phi-3-vision-128k-instruct`, `microsoft/Phi-3.5-vision-instruct` etc. + - + - ✅︎ + - ✅︎ +* - `PixtralForConditionalGeneration` + - Pixtral + - T + I+ + - `mistralai/Pixtral-12B-2409`, `mistral-community/pixtral-12b` etc. + - + - ✅︎ + - ✅︎ +* - `QWenLMHeadModel` + - Qwen-VL + - T + IE+ + - `Qwen/Qwen-VL`, `Qwen/Qwen-VL-Chat`, etc. + - ✅︎ + - ✅︎ + - +* - `Qwen2AudioForConditionalGeneration` + - Qwen2-Audio + - T + A+ + - `Qwen/Qwen2-Audio-7B-Instruct` + - + - ✅︎ + - +* - `Qwen2VLForConditionalGeneration` + - Qwen2-VL + - T + IE+ + VE+ + - `Qwen/QVQ-72B-Preview`, `Qwen/Qwen2-VL-7B-Instruct`, `Qwen/Qwen2-VL-72B-Instruct`, etc. + - ✅︎ + - ✅︎ + - +* - `UltravoxModel` + - Ultravox + - T + AE+ + - `fixie-ai/ultravox-v0_3` + - + - ✅︎ + - ``` -```{eval-rst} -:sup:`E` Pre-computed embeddings can be inputted for this modality. - -:sup:`+` Multiple items can be inputted per text prompt for this modality. -``` +E Pre-computed embeddings can be inputted for this modality. ++ Multiple items can be inputted per text prompt for this modality. ````{important} To enable multiple multi-modal items per text prompt, you have to set {code}`limit_mm_per_prompt` (offline inference) @@ -787,38 +778,37 @@ To get the best results, you should use pooling models that are specifically tra The following table lists those that are tested in vLLM. -```{eval-rst} -.. list-table:: - :widths: 25 25 15 25 5 5 - :header-rows: 1 - - * - Architecture - - Models - - Inputs - - Example HF Models - - :ref:`LoRA ` - - :ref:`PP ` - * - :code:`LlavaNextForConditionalGeneration` - - LLaVA-NeXT-based - - T / I - - :code:`royokong/e5-v` - - - - ✅︎ - * - :code:`Phi3VForCausalLM` - - Phi-3-Vision-based - - T + I - - :code:`TIGER-Lab/VLM2Vec-Full` - - 🚧 - - ✅︎ - * - :code:`Qwen2VLForConditionalGeneration` - - Qwen2-VL-based - - T + I - - :code:`MrLight/dse-qwen2-2b-mrl-v1` - - - - ✅︎ +```{list-table} +:widths: 25 25 15 25 5 5 +:header-rows: 1 + +* - Architecture + - Models + - Inputs + - Example HF Models + - [LoRA](#lora-adapter) + - [PP](#distributed-serving) +* - `LlavaNextForConditionalGeneration` + - LLaVA-NeXT-based + - T / I + - `royokong/e5-v` + - + - ✅︎ +* - `Phi3VForCausalLM` + - Phi-3-Vision-based + - T + I + - `TIGER-Lab/VLM2Vec-Full` + - 🚧 + - ✅︎ +* - `Qwen2VLForConditionalGeneration` + - Qwen2-VL-based + - T + I + - `MrLight/dse-qwen2-2b-mrl-v1` + - + - ✅︎ ``` -______________________________________________________________________ +_________________ # Model Support Policy diff --git a/docs/source/quantization/supported_hardware.md b/docs/source/quantization/supported_hardware.md index 843ee21627d78..7330c2f8aa194 100644 --- a/docs/source/quantization/supported_hardware.md +++ b/docs/source/quantization/supported_hardware.md @@ -4,121 +4,120 @@ The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM: -```{eval-rst} -.. list-table:: - :header-rows: 1 - :widths: 20 8 8 8 8 8 8 8 8 8 8 +```{list-table} +:header-rows: 1 +:widths: 20 8 8 8 8 8 8 8 8 8 8 - * - Implementation - - Volta - - Turing - - Ampere - - Ada - - Hopper - - AMD GPU - - Intel GPU - - x86 CPU - - AWS Inferentia - - Google TPU - * - AWQ - - ✗ - - ✅︎ - - ✅︎ - - ✅︎ - - ✅︎ - - ✗ - - ✅︎ - - ✅︎ - - ✗ - - ✗ - * - GPTQ - - ✅︎ - - ✅︎ - - ✅︎ - - ✅︎ - - ✅︎ - - ✗ - - ✅︎ - - ✅︎ - - ✗ - - ✗ - * - Marlin (GPTQ/AWQ/FP8) - - ✗ - - ✗ - - ✅︎ - - ✅︎ - - ✅︎ - - ✗ - - ✗ - - ✗ - - ✗ - - ✗ - * - INT8 (W8A8) - - ✗ - - ✅︎ - - ✅︎ - - ✅︎ - - ✅︎ - - ✗ - - ✗ - - ✅︎ - - ✗ - - ✗ - * - FP8 (W8A8) - - ✗ - - ✗ - - ✗ - - ✅︎ - - ✅︎ - - ✅︎ - - ✗ - - ✗ - - ✗ - - ✗ - * - AQLM - - ✅︎ - - ✅︎ - - ✅︎ - - ✅︎ - - ✅︎ - - ✗ - - ✗ - - ✗ - - ✗ - - ✗ - * - bitsandbytes - - ✅︎ - - ✅︎ - - ✅︎ - - ✅︎ - - ✅︎ - - ✗ - - ✗ - - ✗ - - ✗ - - ✗ - * - DeepSpeedFP - - ✅︎ - - ✅︎ - - ✅︎ - - ✅︎ - - ✅︎ - - ✗ - - ✗ - - ✗ - - ✗ - - ✗ - * - GGUF - - ✅︎ - - ✅︎ - - ✅︎ - - ✅︎ - - ✅︎ - - ✗ - - ✗ - - ✗ - - ✗ - - ✗ +* - Implementation + - Volta + - Turing + - Ampere + - Ada + - Hopper + - AMD GPU + - Intel GPU + - x86 CPU + - AWS Inferentia + - Google TPU +* - AWQ + - ✗ + - ✅︎ + - ✅︎ + - ✅︎ + - ✅︎ + - ✗ + - ✅︎ + - ✅︎ + - ✗ + - ✗ +* - GPTQ + - ✅︎ + - ✅︎ + - ✅︎ + - ✅︎ + - ✅︎ + - ✗ + - ✅︎ + - ✅︎ + - ✗ + - ✗ +* - Marlin (GPTQ/AWQ/FP8) + - ✗ + - ✗ + - ✅︎ + - ✅︎ + - ✅︎ + - ✗ + - ✗ + - ✗ + - ✗ + - ✗ +* - INT8 (W8A8) + - ✗ + - ✅︎ + - ✅︎ + - ✅︎ + - ✅︎ + - ✗ + - ✗ + - ✅︎ + - ✗ + - ✗ +* - FP8 (W8A8) + - ✗ + - ✗ + - ✗ + - ✅︎ + - ✅︎ + - ✅︎ + - ✗ + - ✗ + - ✗ + - ✗ +* - AQLM + - ✅︎ + - ✅︎ + - ✅︎ + - ✅︎ + - ✅︎ + - ✗ + - ✗ + - ✗ + - ✗ + - ✗ +* - bitsandbytes + - ✅︎ + - ✅︎ + - ✅︎ + - ✅︎ + - ✅︎ + - ✗ + - ✗ + - ✗ + - ✗ + - ✗ +* - DeepSpeedFP + - ✅︎ + - ✅︎ + - ✅︎ + - ✅︎ + - ✅︎ + - ✗ + - ✗ + - ✗ + - ✗ + - ✗ +* - GGUF + - ✅︎ + - ✅︎ + - ✅︎ + - ✅︎ + - ✅︎ + - ✗ + - ✗ + - ✗ + - ✗ + - ✗ ``` ## Notes: diff --git a/docs/source/serving/deploying_with_helm.md b/docs/source/serving/deploying_with_helm.md index 3b26575827011..7286a0a88968f 100644 --- a/docs/source/serving/deploying_with_helm.md +++ b/docs/source/serving/deploying_with_helm.md @@ -43,209 +43,208 @@ chart **including persistent volumes** and deletes the release. ## Values -```{eval-rst} -.. list-table:: Values - :widths: 25 25 25 25 - :header-rows: 1 - - * - Key - - Type - - Default - - Description - * - autoscaling - - object - - {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80} - - Autoscaling configuration - * - autoscaling.enabled - - bool - - false - - Enable autoscaling - * - autoscaling.maxReplicas - - int - - 100 - - Maximum replicas - * - autoscaling.minReplicas - - int - - 1 - - Minimum replicas - * - autoscaling.targetCPUUtilizationPercentage - - int - - 80 - - Target CPU utilization for autoscaling - * - configs - - object - - {} - - Configmap - * - containerPort - - int - - 8000 - - Container port - * - customObjects - - list - - [] - - Custom Objects configuration - * - deploymentStrategy - - object - - {} - - Deployment strategy configuration - * - externalConfigs - - list - - [] - - External configuration - * - extraContainers - - list - - [] - - Additional containers configuration - * - extraInit - - object - - {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true} - - Additional configuration for the init container - * - extraInit.pvcStorage - - string - - "50Gi" - - Storage size of the s3 - * - extraInit.s3modelpath - - string - - "relative_s3_model_path/opt-125m" - - Path of the model on the s3 which hosts model weights and config files - * - extraInit.awsEc2MetadataDisabled - - boolean - - true - - Disables the use of the Amazon EC2 instance metadata service - * - extraPorts - - list - - [] - - Additional ports configuration - * - gpuModels - - list - - ["TYPE_GPU_USED"] - - Type of gpu used - * - image - - object - - {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"} - - Image configuration - * - image.command - - list - - ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"] - - Container launch command - * - image.repository - - string - - "vllm/vllm-openai" - - Image repository - * - image.tag - - string - - "latest" - - Image tag - * - livenessProbe - - object - - {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10} - - Liveness probe configuration - * - livenessProbe.failureThreshold - - int - - 3 - - Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive - * - livenessProbe.httpGet - - object - - {"path":"/health","port":8000} - - Configuration of the Kubelet http request on the server - * - livenessProbe.httpGet.path - - string - - "/health" - - Path to access on the HTTP server - * - livenessProbe.httpGet.port - - int - - 8000 - - Name or number of the port to access on the container, on which the server is listening - * - livenessProbe.initialDelaySeconds - - int - - 15 - - Number of seconds after the container has started before liveness probe is initiated - * - livenessProbe.periodSeconds - - int - - 10 - - How often (in seconds) to perform the liveness probe - * - maxUnavailablePodDisruptionBudget - - string - - "" - - Disruption Budget Configuration - * - readinessProbe - - object - - {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5} - - Readiness probe configuration - * - readinessProbe.failureThreshold - - int - - 3 - - Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready - * - readinessProbe.httpGet - - object - - {"path":"/health","port":8000} - - Configuration of the Kubelet http request on the server - * - readinessProbe.httpGet.path - - string - - "/health" - - Path to access on the HTTP server - * - readinessProbe.httpGet.port - - int - - 8000 - - Name or number of the port to access on the container, on which the server is listening - * - readinessProbe.initialDelaySeconds - - int - - 5 - - Number of seconds after the container has started before readiness probe is initiated - * - readinessProbe.periodSeconds - - int - - 5 - - How often (in seconds) to perform the readiness probe - * - replicaCount - - int - - 1 - - Number of replicas - * - resources - - object - - {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}} - - Resource configuration - * - resources.limits."nvidia.com/gpu" - - int - - 1 - - Number of gpus used - * - resources.limits.cpu - - int - - 4 - - Number of CPUs - * - resources.limits.memory - - string - - "16Gi" - - CPU memory configuration - * - resources.requests."nvidia.com/gpu" - - int - - 1 - - Number of gpus used - * - resources.requests.cpu - - int - - 4 - - Number of CPUs - * - resources.requests.memory - - string - - "16Gi" - - CPU memory configuration - * - secrets - - object - - {} - - Secrets configuration - * - serviceName - - string - - - - Service name - * - servicePort - - int - - 80 - - Service port - * - labels.environment - - string - - test - - Environment name - * - labels.release - - string - - test - - Release name +```{list-table} +:widths: 25 25 25 25 +:header-rows: 1 + +* - Key + - Type + - Default + - Description +* - autoscaling + - object + - {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80} + - Autoscaling configuration +* - autoscaling.enabled + - bool + - false + - Enable autoscaling +* - autoscaling.maxReplicas + - int + - 100 + - Maximum replicas +* - autoscaling.minReplicas + - int + - 1 + - Minimum replicas +* - autoscaling.targetCPUUtilizationPercentage + - int + - 80 + - Target CPU utilization for autoscaling +* - configs + - object + - {} + - Configmap +* - containerPort + - int + - 8000 + - Container port +* - customObjects + - list + - [] + - Custom Objects configuration +* - deploymentStrategy + - object + - {} + - Deployment strategy configuration +* - externalConfigs + - list + - [] + - External configuration +* - extraContainers + - list + - [] + - Additional containers configuration +* - extraInit + - object + - {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true} + - Additional configuration for the init container +* - extraInit.pvcStorage + - string + - "50Gi" + - Storage size of the s3 +* - extraInit.s3modelpath + - string + - "relative_s3_model_path/opt-125m" + - Path of the model on the s3 which hosts model weights and config files +* - extraInit.awsEc2MetadataDisabled + - boolean + - true + - Disables the use of the Amazon EC2 instance metadata service +* - extraPorts + - list + - [] + - Additional ports configuration +* - gpuModels + - list + - ["TYPE_GPU_USED"] + - Type of gpu used +* - image + - object + - {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"} + - Image configuration +* - image.command + - list + - ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"] + - Container launch command +* - image.repository + - string + - "vllm/vllm-openai" + - Image repository +* - image.tag + - string + - "latest" + - Image tag +* - livenessProbe + - object + - {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10} + - Liveness probe configuration +* - livenessProbe.failureThreshold + - int + - 3 + - Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive +* - livenessProbe.httpGet + - object + - {"path":"/health","port":8000} + - Configuration of the Kubelet http request on the server +* - livenessProbe.httpGet.path + - string + - "/health" + - Path to access on the HTTP server +* - livenessProbe.httpGet.port + - int + - 8000 + - Name or number of the port to access on the container, on which the server is listening +* - livenessProbe.initialDelaySeconds + - int + - 15 + - Number of seconds after the container has started before liveness probe is initiated +* - livenessProbe.periodSeconds + - int + - 10 + - How often (in seconds) to perform the liveness probe +* - maxUnavailablePodDisruptionBudget + - string + - "" + - Disruption Budget Configuration +* - readinessProbe + - object + - {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5} + - Readiness probe configuration +* - readinessProbe.failureThreshold + - int + - 3 + - Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready +* - readinessProbe.httpGet + - object + - {"path":"/health","port":8000} + - Configuration of the Kubelet http request on the server +* - readinessProbe.httpGet.path + - string + - "/health" + - Path to access on the HTTP server +* - readinessProbe.httpGet.port + - int + - 8000 + - Name or number of the port to access on the container, on which the server is listening +* - readinessProbe.initialDelaySeconds + - int + - 5 + - Number of seconds after the container has started before readiness probe is initiated +* - readinessProbe.periodSeconds + - int + - 5 + - How often (in seconds) to perform the readiness probe +* - replicaCount + - int + - 1 + - Number of replicas +* - resources + - object + - {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}} + - Resource configuration +* - resources.limits."nvidia.com/gpu" + - int + - 1 + - Number of gpus used +* - resources.limits.cpu + - int + - 4 + - Number of CPUs +* - resources.limits.memory + - string + - "16Gi" + - CPU memory configuration +* - resources.requests."nvidia.com/gpu" + - int + - 1 + - Number of gpus used +* - resources.requests.cpu + - int + - 4 + - Number of CPUs +* - resources.requests.memory + - string + - "16Gi" + - CPU memory configuration +* - secrets + - object + - {} + - Secrets configuration +* - serviceName + - string + - + - Service name +* - servicePort + - int + - 80 + - Service port +* - labels.environment + - string + - test + - Environment name +* - labels.release + - string + - test + - Release name ```