Reorganize Serving section

Signed-off-by: DarkLight1337 <[email protected]>
vllm-project · Jan 6, 2025 · 7923116 · 7923116
1 parent 996357e
commit 7923116
Show file tree

Hide file tree

Showing 31 changed files with 191 additions and 80 deletions.
diff --git a/.../serving/architecture_helm_deployment.png → ...ployment/architecture_helm_deployment.png b/.../serving/architecture_helm_deployment.png → ...ployment/architecture_helm_deployment.png
diff --git a/docs/source/contributing/dockerfile/dockerfile.md b/docs/source/contributing/dockerfile/dockerfile.md
@@ -1,7 +1,7 @@
 # Dockerfile
 
 We provide a <gh-file:Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
-More information about deploying with Docker can be found [here](../../serving/deploying_with_docker.md).
+More information about deploying with Docker can be found [here](#deployment-docker).
 
 Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
 

diff --git a/docs/source/serving/deploying_with_docker.md → docs/source/deployment/docker.md b/docs/source/serving/deploying_with_docker.md → docs/source/deployment/docker.md
@@ -1,6 +1,6 @@
-(deploying-with-docker)=
+(deployment-docker)=
 
-# Deploying with Docker
+# Using Docker
 
 ## Use vLLM's Official Docker Image
 

diff --git a/.../source/serving/deploying_with_bentoml.md → docs/source/deployment/frameworks/bentoml.md b/.../source/serving/deploying_with_bentoml.md → docs/source/deployment/frameworks/bentoml.md
@@ -1,6 +1,6 @@
-(deploying-with-bentoml)=
+(deployment-bentoml)=
 
-# Deploying with BentoML
+# BentoML
 
 [BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-complicant image and deploy it on Kubernetes.
 

diff --git a/...ource/serving/deploying_with_cerebrium.md → ...source/deployment/frameworks/cerebrium.md b/...ource/serving/deploying_with_cerebrium.md → ...source/deployment/frameworks/cerebrium.md
@@ -1,6 +1,6 @@
-(deploying-with-cerebrium)=
+(deployment-cerebrium)=
 
-# Deploying with Cerebrium
+# Cerebrium
 
 ```{raw} html
 <p align="center">

diff --git a/docs/source/serving/deploying_with_dstack.md → docs/source/deployment/frameworks/dstack.md b/docs/source/serving/deploying_with_dstack.md → docs/source/deployment/frameworks/dstack.md
@@ -1,6 +1,6 @@
-(deploying-with-dstack)=
+(deployment-dstack)=
 
-# Deploying with dstack
+# dstack
 
 ```{raw} html
 <p align="center">

diff --git a/docs/source/serving/deploying_with_helm.md → docs/source/deployment/frameworks/helm.md b/docs/source/serving/deploying_with_helm.md → docs/source/deployment/frameworks/helm.md
@@ -1,6 +1,6 @@
-(deploying-with-helm)=
+(deployment-helm)=
 
-# Deploying with Helm
+# Helm
 
 A Helm chart to deploy vLLM for Kubernetes
 
@@ -38,7 +38,7 @@ chart **including persistent volumes** and deletes the release.
 
 ## Architecture
 
-```{image} architecture_helm_deployment.png
+```{image} /assets/deployment/architecture_helm_deployment.png
 ```
 
 ## Values

diff --git a/docs/source/deployment/frameworks/index.md b/docs/source/deployment/frameworks/index.md
@@ -0,0 +1,13 @@
+# Using other frameworks
+
+```{toctree}
+:maxdepth: 1
+
+bentoml
+cerebrium
+dstack
+helm
+lws
+skypilot
+triton
+```
diff --git a/docs/source/serving/deploying_with_lws.md → docs/source/deployment/frameworks/lws.md b/docs/source/serving/deploying_with_lws.md → docs/source/deployment/frameworks/lws.md
@@ -1,6 +1,6 @@
-(deploying-with-lws)=
+(deployment-lws)=
 
-# Deploying with LWS
+# LWS
 
 LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
 A major use case is for multi-host/multi-node distributed inference.

diff --git a/docs/source/serving/run_on_sky.md → .../source/deployment/frameworks/skypilot.md b/docs/source/serving/run_on_sky.md → .../source/deployment/frameworks/skypilot.md
@@ -1,6 +1,6 @@
-(on-cloud)=
+(deployment-skypilot)=
 
-# Deploying and scaling up with SkyPilot
+# SkyPilot
 
 ```{raw} html
 <p align="center">

diff --git a/docs/source/serving/deploying_with_triton.md → docs/source/deployment/frameworks/triton.md b/docs/source/serving/deploying_with_triton.md → docs/source/deployment/frameworks/triton.md
@@ -1,5 +1,5 @@
-(deploying-with-triton)=
+(deployment-triton)=
 
-# Deploying with NVIDIA Triton
+# NVIDIA Triton
 
 The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details.
diff --git a/docs/source/deployment/integrations/index.md b/docs/source/deployment/integrations/index.md
@@ -0,0 +1,9 @@
+# External integrations
+
+```{toctree}
+:maxdepth: 1
+
+kserve
+kubeai
+llamastack
+```
diff --git a/docs/source/serving/deploying_with_kserve.md → .../source/deployment/integrations/kserve.md b/docs/source/serving/deploying_with_kserve.md → .../source/deployment/integrations/kserve.md
@@ -1,6 +1,6 @@
-(deploying-with-kserve)=
+(deployment-kserve)=
 
-# Deploying with KServe
+# KServe
 
 vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.
 

diff --git a/docs/source/serving/deploying_with_kubeai.md → .../source/deployment/integrations/kubeai.md b/docs/source/serving/deploying_with_kubeai.md → .../source/deployment/integrations/kubeai.md
@@ -1,6 +1,6 @@
-(deploying-with-kubeai)=
+(deployment-kubeai)=
 
-# Deploying with KubeAI
+# KubeAI
 
 [KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
 

diff --git a/...source/serving/serving_with_llamastack.md → ...rce/deployment/integrations/llamastack.md b/...source/serving/serving_with_llamastack.md → ...rce/deployment/integrations/llamastack.md
@@ -1,6 +1,6 @@
-(run-on-llamastack)=
+(deployment-llamastack)=
 
-# Serving with Llama Stack
+# Llama Stack
 
 vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .
 

diff --git a/docs/source/serving/deploying_with_k8s.md → docs/source/deployment/k8s.md b/docs/source/serving/deploying_with_k8s.md → docs/source/deployment/k8s.md
@@ -1,6 +1,6 @@
-(deploying-with-k8s)=
+(deployment-k8s)=
 
-# Deploying with Kubernetes
+# Using Kubernetes
 
 Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
 
@@ -43,7 +43,7 @@ metadata:
   name: hf-token-secret
   namespace: default
 type: Opaque
-stringData:
+data:
   token: "REPLACE_WITH_TOKEN"
 ```
 

diff --git a/docs/source/serving/deploying_with_nginx.md → docs/source/deployment/nginx.md b/docs/source/serving/deploying_with_nginx.md → docs/source/deployment/nginx.md
@@ -1,6 +1,6 @@
 (nginxloadbalancer)=
 
-# Deploying with Nginx Loadbalancer
+# Using Nginx
 
 This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.
 

diff --git a/docs/source/getting_started/installation/hpu-gaudi.md b/docs/source/getting_started/installation/hpu-gaudi.md
@@ -82,7 +82,7 @@ $ python setup.py develop
 
 ## Supported Features
 
-- [Offline batched inference](#offline-batched-inference)
+- [Offline inference](#offline-inference)
 - Online inference via [OpenAI-Compatible Server](#openai-compatible-server)
 - HPU autodetection - no need to manually select device within vLLM
 - Paged KV cache with algorithms enabled for Intel Gaudi accelerators

diff --git a/docs/source/getting_started/quickstart.md b/docs/source/getting_started/quickstart.md
@@ -2,32 +2,34 @@
 
 # Quickstart
 
-This guide will help you quickly get started with vLLM to:
+This guide will help you quickly get started with vLLM to perform:
 
-- [Run offline batched inference](#offline-batched-inference)
-- [Run OpenAI-compatible inference](#openai-compatible-server)
+- [Offline batched inference](#quickstart-offline)
+- [Online inference using OpenAI-compatible server](#quickstart-online)
 
 ## Prerequisites
 
 - OS: Linux
 - Python: 3.9 -- 3.12
-- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
 
 ## Installation
 
-You can install vLLM using pip. It's recommended to use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.
+If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/project/vllm/) directly.
+It's recommended to use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.
 
 ```console
 $ conda create -n myenv python=3.10 -y
 $ conda activate myenv
 $ pip install vllm
 ```
 
-Please refer to the [installation documentation](#installation-index) for more details on installing vLLM.
+```{note}
+For non-CUDA platforms, please refer [here](#installation-index) for specific instructions on how to install vLLM.
+```
 
-(offline-batched-inference)=
+(quickstart-offline)=
 
-## Offline Batched Inference
+## Offline batched inference
 
 With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference.py>
 
@@ -73,9 +75,9 @@ for output in outputs:
     print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
 ```
 
-(openai-compatible-server)=
+(quickstart-online)=
 
-## OpenAI-Compatible Server
+## OpenAI-compatible server
 
 vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
 By default, it starts the server at `http://localhost:8000`. You can specify the address with `--host` and `--port` arguments. The server currently hosts one model at a time and implements endpoints such as [list models](https://platform.openai.com/docs/api-reference/models/list), [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create), and [create completion](https://platform.openai.com/docs/api-reference/completions/create) endpoints.

diff --git a/docs/source/index.md b/docs/source/index.md
@@ -66,19 +66,26 @@ getting_started/faq
 ```
 
 ```{toctree}
-:caption: Serving
+:caption: Inference and Serving
 :maxdepth: 1
 
+serving/offline_inference
 serving/openai_compatible_server
-serving/deploying_with_docker
-serving/deploying_with_k8s
-serving/deploying_with_helm
-serving/deploying_with_nginx
 serving/distributed_serving
 serving/metrics
-serving/integrations
-serving/tensorizer
-serving/runai_model_streamer
+serving/integrations/index
+serving/multimodal_inputs
+```
+
+```{toctree}
+:caption: Deployment
+:maxdepth: 1
+
+deployment/docker
+deployment/k8s
+deployment/nginx
+deployment/frameworks/index
+deployment/integrations/index
 ```
 
 ```{toctree}
@@ -90,14 +97,14 @@ models/generative_models
 models/pooling_models
 models/adding_model
 models/enabling_multimodal_inputs
+models/loaders/index
 ```
 
 ```{toctree}
 :caption: Usage
 :maxdepth: 1
 
 usage/lora
-usage/multimodal_inputs
 usage/tool_calling
 usage/structured_outputs
 usage/spec_decode

diff --git a/docs/source/models/loaders/index.md b/docs/source/models/loaders/index.md
@@ -0,0 +1,8 @@
+# Alternative model loaders
+
+```{toctree}
+:maxdepth: 1
+
+runai_model_streamer
+tensorizer
+```
diff --git a/docs/source/serving/runai_model_streamer.md → ...ce/models/loaders/runai_model_streamer.md b/docs/source/serving/runai_model_streamer.md → ...ce/models/loaders/runai_model_streamer.md
@@ -1,6 +1,6 @@
 (runai-model-streamer)=
 
-# Loading Models with Run:ai Model Streamer
+# Run:ai Model Streamer
 
 Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
 Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).

diff --git a/docs/source/serving/tensorizer.md → docs/source/models/loaders/tensorizer.md b/docs/source/serving/tensorizer.md → docs/source/models/loaders/tensorizer.md
@@ -1,6 +1,6 @@
 (tensorizer)=
 
-# Loading Models with CoreWeave's Tensorizer
+# Tensorizer
 
 vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
 vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized

diff --git a/docs/source/serving/integrations.md b/docs/source/serving/integrations.md
diff --git a/docs/source/serving/integrations/index.md b/docs/source/serving/integrations/index.md
@@ -0,0 +1,8 @@
+# External integrations
+
+```{toctree}
+:maxdepth: 1
+
+langchain
+llamaindex
+```
diff --git a/.../source/serving/serving_with_langchain.md → .../source/serving/integrations/langchain.md b/.../source/serving/serving_with_langchain.md → .../source/serving/integrations/langchain.md
@@ -1,10 +1,10 @@
-(run-on-langchain)=
+(serving-langchain)=
 
-# Serving with Langchain
+# LangChain
 
-vLLM is also available via [Langchain](https://github.com/langchain-ai/langchain) .
+vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .
 
-To install langchain, run
+To install LangChain, run
 
 ```console
 $ pip install langchain langchain_community -q

diff --git a/...source/serving/serving_with_llamaindex.md → ...source/serving/integrations/llamaindex.md b/...source/serving/serving_with_llamaindex.md → ...source/serving/integrations/llamaindex.md
@@ -1,10 +1,10 @@
-(run-on-llamaindex)=
+(serving-llamaindex)=
 
-# Serving with llama_index
+# LlamaIndex
 
-vLLM is also available via [llama_index](https://github.com/run-llama/llama_index) .
+vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .
 
-To install llamaindex, run
+To install LlamaIndex, run
 
 ```console
 $ pip install llama-index-llms-vllm -q

diff --git a/docs/source/serving/metrics.md b/docs/source/serving/metrics.md
@@ -4,7 +4,7 @@ vLLM exposes a number of metrics that can be used to monitor the health of the
 system. These metrics are exposed via the `/metrics` endpoint on the vLLM
 OpenAI compatible API server.
 
-You can start the server using Python, or using [Docker](deploying_with_docker.md):
+You can start the server using Python, or using [Docker](#deployment-docker):
 
 ```console
 $ vllm serve unsloth/Llama-3.2-1B-Instruct

diff --git a/docs/source/usage/multimodal_inputs.md → docs/source/serving/multimodal_inputs.md b/docs/source/usage/multimodal_inputs.md → docs/source/serving/multimodal_inputs.md