vllm-project · youkaichao · Jan 7, 2025 · Jan 6, 2025 · Jan 6, 2025 · Jan 6, 2025
diff --git a/README.md b/README.md
@@ -77,7 +77,7 @@ pip install vllm
 Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to learn more.
 - [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html)
 - [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
-- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
+- [List of Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
 
 ## Contributing
 

diff --git a/.../serving/architecture_helm_deployment.png → ...ployment/architecture_helm_deployment.png b/.../serving/architecture_helm_deployment.png → ...ployment/architecture_helm_deployment.png
diff --git a/docs/source/contributing/dockerfile/dockerfile.md b/docs/source/contributing/dockerfile/dockerfile.md
@@ -1,7 +1,7 @@
 # Dockerfile
 
 We provide a <gh-file:Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
-More information about deploying with Docker can be found [here](../../serving/deploying_with_docker.md).
+More information about deploying with Docker can be found [here](#deployment-docker).
 
 Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
 

diff --git a/docs/source/contributing/model/registration.md b/docs/source/contributing/model/registration.md
@@ -3,7 +3,7 @@
 # Model Registration
 
 vLLM relies on a model registry to determine how to run each model.
-A list of pre-registered architectures can be found on the [Supported Models](#supported-models) page.
+A list of pre-registered architectures can be found [here](#supported-models).
 
 If your model is not on this list, you must register it to vLLM.
 This page provides detailed instructions on how to do so.
@@ -16,7 +16,7 @@ This gives you the ability to modify the codebase and test your model.
 After you have implemented your model (see [tutorial](#new-model-basic)), put it into the <gh-dir:vllm/model_executor/models> directory.
 Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
 You should also include an example HuggingFace repository for this model in <gh-file:tests/models/registry.py> to run the unit tests.
-Finally, update the [Supported Models](#supported-models) documentation page to promote your model!
+Finally, update our [list of supported models](#supported-models) to promote your model!
 
 ```{important}
 The list of models in each section should be maintained in alphabetical order.

diff --git a/docs/source/serving/deploying_with_docker.md → docs/source/deployment/docker.md b/docs/source/serving/deploying_with_docker.md → docs/source/deployment/docker.md
@@ -1,6 +1,6 @@
-(deploying-with-docker)=
+(deployment-docker)=
 
-# Deploying with Docker
+# Using Docker
 
 ## Use vLLM's Official Docker Image
 

diff --git a/.../source/serving/deploying_with_bentoml.md → docs/source/deployment/frameworks/bentoml.md b/.../source/serving/deploying_with_bentoml.md → docs/source/deployment/frameworks/bentoml.md
@@ -1,6 +1,6 @@
-(deploying-with-bentoml)=
+(deployment-bentoml)=
 
-# Deploying with BentoML
+# BentoML
 
 [BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-complicant image and deploy it on Kubernetes.
 

diff --git a/...ource/serving/deploying_with_cerebrium.md → ...source/deployment/frameworks/cerebrium.md b/...ource/serving/deploying_with_cerebrium.md → ...source/deployment/frameworks/cerebrium.md
@@ -1,6 +1,6 @@
-(deploying-with-cerebrium)=
+(deployment-cerebrium)=
 
-# Deploying with Cerebrium
+# Cerebrium
 
 ```{raw} html
 <p align="center">

diff --git a/docs/source/serving/deploying_with_dstack.md → docs/source/deployment/frameworks/dstack.md b/docs/source/serving/deploying_with_dstack.md → docs/source/deployment/frameworks/dstack.md
@@ -1,6 +1,6 @@
-(deploying-with-dstack)=
+(deployment-dstack)=
 
-# Deploying with dstack
+# dstack
 
 ```{raw} html
 <p align="center">

diff --git a/docs/source/serving/deploying_with_helm.md → docs/source/deployment/frameworks/helm.md b/docs/source/serving/deploying_with_helm.md → docs/source/deployment/frameworks/helm.md
@@ -1,6 +1,6 @@
-(deploying-with-helm)=
+(deployment-helm)=
 
-# Deploying with Helm
+# Helm
 
 A Helm chart to deploy vLLM for Kubernetes
 
@@ -38,7 +38,7 @@ chart **including persistent volumes** and deletes the release.
 
 ## Architecture
 
-```{image} architecture_helm_deployment.png
+```{image} /assets/deployment/architecture_helm_deployment.png
 ```
 
 ## Values

diff --git a/docs/source/deployment/frameworks/index.md b/docs/source/deployment/frameworks/index.md
@@ -0,0 +1,13 @@
+# Using other frameworks
+
+```{toctree}
+:maxdepth: 1
+
+bentoml
+cerebrium
+dstack
+helm
+lws
+skypilot
+triton
+```
diff --git a/docs/source/serving/deploying_with_lws.md → docs/source/deployment/frameworks/lws.md b/docs/source/serving/deploying_with_lws.md → docs/source/deployment/frameworks/lws.md
@@ -1,6 +1,6 @@
-(deploying-with-lws)=
+(deployment-lws)=
 
-# Deploying with LWS
+# LWS
 
 LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
 A major use case is for multi-host/multi-node distributed inference.

diff --git a/docs/source/serving/run_on_sky.md → .../source/deployment/frameworks/skypilot.md b/docs/source/serving/run_on_sky.md → .../source/deployment/frameworks/skypilot.md
@@ -1,6 +1,6 @@
-(on-cloud)=
+(deployment-skypilot)=
 
-# Deploying and scaling up with SkyPilot
+# SkyPilot
 
 ```{raw} html
 <p align="center">
@@ -12,9 +12,9 @@ vLLM can be **run and scaled to multiple service replicas on clouds and Kubernet
 
 ## Prerequisites
 
-- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model {code}`meta-llama/Meta-Llama-3-8B-Instruct`.
+- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model `meta-llama/Meta-Llama-3-8B-Instruct`.
 - Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
-- Check that {code}`sky check` shows clouds or Kubernetes are enabled.
+- Check that `sky check` shows clouds or Kubernetes are enabled.
 
 ```console
 pip install skypilot-nightly

diff --git a/docs/source/serving/deploying_with_triton.md → docs/source/deployment/frameworks/triton.md b/docs/source/serving/deploying_with_triton.md → docs/source/deployment/frameworks/triton.md
@@ -1,5 +1,5 @@
-(deploying-with-triton)=
+(deployment-triton)=
 
-# Deploying with NVIDIA Triton
+# NVIDIA Triton
 
 The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details.
diff --git a/docs/source/deployment/integrations/index.md b/docs/source/deployment/integrations/index.md
@@ -0,0 +1,9 @@
+# External Integrations
+
+```{toctree}
+:maxdepth: 1
+
+kserve
+kubeai
+llamastack
+```
diff --git a/docs/source/serving/deploying_with_kserve.md → .../source/deployment/integrations/kserve.md b/docs/source/serving/deploying_with_kserve.md → .../source/deployment/integrations/kserve.md
@@ -1,6 +1,6 @@
-(deploying-with-kserve)=
+(deployment-kserve)=
 
-# Deploying with KServe
+# KServe
 
 vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.
 

diff --git a/docs/source/serving/deploying_with_kubeai.md → .../source/deployment/integrations/kubeai.md b/docs/source/serving/deploying_with_kubeai.md → .../source/deployment/integrations/kubeai.md
@@ -1,6 +1,6 @@
-(deploying-with-kubeai)=
+(deployment-kubeai)=
 
-# Deploying with KubeAI
+# KubeAI
 
 [KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
 

diff --git a/...source/serving/serving_with_llamastack.md → ...rce/deployment/integrations/llamastack.md b/...source/serving/serving_with_llamastack.md → ...rce/deployment/integrations/llamastack.md
@@ -1,6 +1,6 @@
-(run-on-llamastack)=
+(deployment-llamastack)=
 
-# Serving with Llama Stack
+# Llama Stack
 
 vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .
 

diff --git a/docs/source/serving/deploying_with_k8s.md → docs/source/deployment/k8s.md b/docs/source/serving/deploying_with_k8s.md → docs/source/deployment/k8s.md
@@ -1,6 +1,6 @@
-(deploying-with-k8s)=
+(deployment-k8s)=
 
-# Deploying with Kubernetes
+# Using Kubernetes
 
 Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
 

diff --git a/docs/source/serving/deploying_with_nginx.md → docs/source/deployment/nginx.md b/docs/source/serving/deploying_with_nginx.md → docs/source/deployment/nginx.md
@@ -1,6 +1,6 @@
 (nginxloadbalancer)=
 
-# Deploying with Nginx Loadbalancer
+# Using Nginx
 
 This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.
 

diff --git a/docs/source/design/arch_overview.md b/docs/source/design/arch_overview.md
@@ -57,7 +57,7 @@ More API details can be found in the {doc}`Offline Inference
 
 The code for the `LLM` class can be found in <gh-file:vllm/entrypoints/llm.py>.
 
-### OpenAI-compatible API server
+### OpenAI-Compatible API Server
 
 The second primary interface to vLLM is via its OpenAI-compatible API server.
 This server can be started using the `vllm serve` command.

diff --git a/docs/source/features/disagg_prefill.md b/docs/source/features/disagg_prefill.md
@@ -1,8 +1,12 @@
 (disagg-prefill)=
 
-# Disaggregated prefilling (experimental)
+# Disaggregated Prefilling (experimental)
 
-This page introduces you the disaggregated prefilling feature in vLLM. This feature is experimental and subject to change.
+This page introduces you the disaggregated prefilling feature in vLLM.
+
+```{note}
+This feature is experimental and subject to change.
+```
 
 ## Why disaggregated prefilling?
 

diff --git a/docs/source/features/spec_decode.md b/docs/source/features/spec_decode.md
@@ -1,6 +1,6 @@
 (spec-decode)=
 
-# Speculative decoding
+# Speculative Decoding
 
 ```{warning}
 Please note that speculative decoding in vLLM is not yet optimized and does

diff --git a/docs/source/getting_started/installation/gpu-rocm.md b/docs/source/getting_started/installation/gpu-rocm.md
@@ -148,7 +148,7 @@ $ export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
 $ python3 setup.py develop
 ```
 
-This may take 5-10 minutes. Currently, {code}`pip install .` does not work for ROCm installation.
+This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
 
 ```{tip}
 - Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.

diff --git a/docs/source/getting_started/installation/hpu-gaudi.md b/docs/source/getting_started/installation/hpu-gaudi.md
@@ -82,7 +82,7 @@ $ python setup.py develop
 
 ## Supported Features
 
-- [Offline batched inference](#offline-batched-inference)
+- [Offline inference](#offline-inference)
 - Online inference via [OpenAI-Compatible Server](#openai-compatible-server)
 - HPU autodetection - no need to manually select device within vLLM
 - Paged KV cache with algorithms enabled for Intel Gaudi accelerators

diff --git a/docs/source/getting_started/quickstart.md b/docs/source/getting_started/quickstart.md
@@ -2,30 +2,32 @@
 
 # Quickstart
 
-This guide will help you quickly get started with vLLM to:
+This guide will help you quickly get started with vLLM to perform:
 
-- [Run offline batched inference](#offline-batched-inference)
-- [Run OpenAI-compatible inference](#openai-compatible-server)
+- [Offline batched inference](#quickstart-offline)
+- [Online inference using OpenAI-compatible server](#quickstart-online)
 
 ## Prerequisites
 
 - OS: Linux
 - Python: 3.9 -- 3.12
-- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
 
 ## Installation
 
-You can install vLLM using pip. It's recommended to use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.
+If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/project/vllm/) directly.
+It's recommended to use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.
 
 ```console
 $ conda create -n myenv python=3.10 -y
 $ conda activate myenv
 $ pip install vllm
 ```
 
-Please refer to the [installation documentation](#installation-index) for more details on installing vLLM.
+```{note}
+For non-CUDA platforms, please refer [here](#installation-index) for specific instructions on how to install vLLM.
+```
 
-(offline-batched-inference)=
+(quickstart-offline)=
 
 ## Offline Batched Inference
 
@@ -73,7 +75,7 @@ for output in outputs:
     print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
 ```
 
-(openai-compatible-server)=
+(quickstart-online)=
 
 ## OpenAI-Compatible Server
 

diff --git a/docs/source/index.md b/docs/source/index.md
@@ -65,32 +65,14 @@ getting_started/troubleshooting
 getting_started/faq
 ```
 
-```{toctree}
-:caption: Serving
-:maxdepth: 1
-
-serving/openai_compatible_server
-serving/deploying_with_docker
-serving/deploying_with_k8s
-serving/deploying_with_helm
-serving/deploying_with_nginx
-serving/distributed_serving
-serving/metrics
-serving/integrations
-serving/tensorizer
-serving/runai_model_streamer
-serving/engine_args
-serving/env_vars
-serving/usage_stats
-```
-
 ```{toctree}
 :caption: Models
 :maxdepth: 1
 
-models/supported_models
 models/generative_models
 models/pooling_models
+models/supported_models
+models/extensions/index
 ```
 
 ```{toctree}
@@ -99,7 +81,6 @@ models/pooling_models
 
 features/quantization/index
 features/lora
-features/multimodal_inputs
 features/tool_calling
 features/structured_outputs
 features/automatic_prefix_caching
@@ -108,6 +89,32 @@ features/spec_decode
 features/compatibility_matrix
 ```
 
+```{toctree}
+:caption: Inference and Serving
+:maxdepth: 1
+
+serving/offline_inference
+serving/openai_compatible_server
+serving/multimodal_inputs
+serving/distributed_serving
+serving/metrics
+serving/engine_args
+serving/env_vars
+serving/usage_stats
+serving/integrations/index
+```
+
+```{toctree}
+:caption: Deployment
+:maxdepth: 1
+
+deployment/docker
+deployment/k8s
+deployment/nginx
+deployment/frameworks/index
+deployment/integrations/index
+```
+
 ```{toctree}
 :caption: Performance
 :maxdepth: 1

diff --git a/docs/source/models/extensions/index.md b/docs/source/models/extensions/index.md
@@ -0,0 +1,8 @@
+# Built-in Extensions
+
+```{toctree}
+:maxdepth: 1
+
+runai_model_streamer
+tensorizer
+```
diff --git a/docs/source/serving/runai_model_streamer.md → ...models/extensions/runai_model_streamer.md b/docs/source/serving/runai_model_streamer.md → ...models/extensions/runai_model_streamer.md
@@ -1,6 +1,6 @@
 (runai-model-streamer)=
 
-# Loading Models with Run:ai Model Streamer
+# Loading models with Run:ai Model Streamer
 
 Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
 Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).

diff --git a/docs/source/serving/tensorizer.md → docs/source/models/extensions/tensorizer.md b/docs/source/serving/tensorizer.md → docs/source/models/extensions/tensorizer.md
@@ -1,6 +1,6 @@
 (tensorizer)=
 
-# Loading Models with CoreWeave's Tensorizer
+# Loading models with CoreWeave's Tensorizer
 
 vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
 vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized