Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc][3/N] Reorganize Serving section #11766

Merged
merged 12 commits into from
Jan 7, 2025
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ pip install vllm
Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to learn more.
- [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html)
- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
- [List of Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)

## Contributing

Expand Down
2 changes: 1 addition & 1 deletion docs/source/contributing/dockerfile/dockerfile.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Dockerfile

We provide a <gh-file:Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
More information about deploying with Docker can be found [here](../../serving/deploying_with_docker.md).
More information about deploying with Docker can be found [here](#deployment-docker).

Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:

Expand Down
4 changes: 2 additions & 2 deletions docs/source/contributing/model/registration.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# Model Registration

vLLM relies on a model registry to determine how to run each model.
A list of pre-registered architectures can be found on the [Supported Models](#supported-models) page.
A list of pre-registered architectures can be found [here](#supported-models).

If your model is not on this list, you must register it to vLLM.
This page provides detailed instructions on how to do so.
Expand All @@ -16,7 +16,7 @@ This gives you the ability to modify the codebase and test your model.
After you have implemented your model (see [tutorial](#new-model-basic)), put it into the <gh-dir:vllm/model_executor/models> directory.
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
You should also include an example HuggingFace repository for this model in <gh-file:tests/models/registry.py> to run the unit tests.
Finally, update the [Supported Models](#supported-models) documentation page to promote your model!
Finally, update our [list of supported models](#supported-models) to promote your model!

```{important}
The list of models in each section should be maintained in alphabetical order.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-docker)=
(deployment-docker)=

# Deploying with Docker
# Using Docker

## Use vLLM's Official Docker Image

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-bentoml)=
(deployment-bentoml)=

# Deploying with BentoML
# BentoML

[BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-complicant image and deploy it on Kubernetes.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-cerebrium)=
(deployment-cerebrium)=

# Deploying with Cerebrium
# Cerebrium

```{raw} html
<p align="center">
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-dstack)=
(deployment-dstack)=

# Deploying with dstack
# dstack

```{raw} html
<p align="center">
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-helm)=
(deployment-helm)=

# Deploying with Helm
# Helm

A Helm chart to deploy vLLM for Kubernetes

Expand Down Expand Up @@ -38,7 +38,7 @@ chart **including persistent volumes** and deletes the release.

## Architecture

```{image} architecture_helm_deployment.png
```{image} /assets/deployment/architecture_helm_deployment.png
```

## Values
Expand Down
13 changes: 13 additions & 0 deletions docs/source/deployment/frameworks/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Using other frameworks

```{toctree}
:maxdepth: 1

bentoml
cerebrium
dstack
helm
lws
skypilot
triton
```
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-lws)=
(deployment-lws)=

# Deploying with LWS
# LWS

LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
A major use case is for multi-host/multi-node distributed inference.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(on-cloud)=
(deployment-skypilot)=

# Deploying and scaling up with SkyPilot
# SkyPilot

```{raw} html
<p align="center">
Expand All @@ -12,9 +12,9 @@ vLLM can be **run and scaled to multiple service replicas on clouds and Kubernet

## Prerequisites

- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model {code}`meta-llama/Meta-Llama-3-8B-Instruct`.
- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model `meta-llama/Meta-Llama-3-8B-Instruct`.
- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
- Check that {code}`sky check` shows clouds or Kubernetes are enabled.
- Check that `sky check` shows clouds or Kubernetes are enabled.

```console
pip install skypilot-nightly
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
(deploying-with-triton)=
(deployment-triton)=

# Deploying with NVIDIA Triton
# NVIDIA Triton

The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details.
9 changes: 9 additions & 0 deletions docs/source/deployment/integrations/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# External Integrations

```{toctree}
:maxdepth: 1

kserve
kubeai
llamastack
```
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-kserve)=
(deployment-kserve)=

# Deploying with KServe
# KServe

vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-kubeai)=
(deployment-kubeai)=

# Deploying with KubeAI
# KubeAI

[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(run-on-llamastack)=
(deployment-llamastack)=

# Serving with Llama Stack
# Llama Stack

vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-k8s)=
(deployment-k8s)=

# Deploying with Kubernetes
# Using Kubernetes

Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(nginxloadbalancer)=

# Deploying with Nginx Loadbalancer
# Using Nginx

This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/design/arch_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ More API details can be found in the {doc}`Offline Inference

The code for the `LLM` class can be found in <gh-file:vllm/entrypoints/llm.py>.

### OpenAI-compatible API server
### OpenAI-Compatible API Server

The second primary interface to vLLM is via its OpenAI-compatible API server.
This server can be started using the `vllm serve` command.
Expand Down
8 changes: 6 additions & 2 deletions docs/source/features/disagg_prefill.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
(disagg-prefill)=

# Disaggregated prefilling (experimental)
# Disaggregated Prefilling (experimental)

This page introduces you the disaggregated prefilling feature in vLLM. This feature is experimental and subject to change.
This page introduces you the disaggregated prefilling feature in vLLM.

```{note}
This feature is experimental and subject to change.
```

## Why disaggregated prefilling?

Expand Down
2 changes: 1 addition & 1 deletion docs/source/features/spec_decode.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(spec-decode)=

# Speculative decoding
# Speculative Decoding

```{warning}
Please note that speculative decoding in vLLM is not yet optimized and does
Expand Down
2 changes: 1 addition & 1 deletion docs/source/getting_started/installation/gpu-rocm.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ $ export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
$ python3 setup.py develop
```

This may take 5-10 minutes. Currently, {code}`pip install .` does not work for ROCm installation.
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.

```{tip}
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/getting_started/installation/hpu-gaudi.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ $ python setup.py develop

## Supported Features

- [Offline batched inference](#offline-batched-inference)
- [Offline inference](#offline-inference)
- Online inference via [OpenAI-Compatible Server](#openai-compatible-server)
- HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
Expand Down
18 changes: 10 additions & 8 deletions docs/source/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,30 +2,32 @@

# Quickstart

This guide will help you quickly get started with vLLM to:
This guide will help you quickly get started with vLLM to perform:

- [Run offline batched inference](#offline-batched-inference)
- [Run OpenAI-compatible inference](#openai-compatible-server)
- [Offline batched inference](#quickstart-offline)
- [Online inference using OpenAI-compatible server](#quickstart-online)

## Prerequisites

- OS: Linux
- Python: 3.9 -- 3.12
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)

## Installation

You can install vLLM using pip. It's recommended to use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.
If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/project/vllm/) directly.
It's recommended to use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.

```console
$ conda create -n myenv python=3.10 -y
$ conda activate myenv
$ pip install vllm
```

Please refer to the [installation documentation](#installation-index) for more details on installing vLLM.
```{note}
For non-CUDA platforms, please refer [here](#installation-index) for specific instructions on how to install vLLM.
```

(offline-batched-inference)=
(quickstart-offline)=

## Offline Batched Inference

Expand Down Expand Up @@ -73,7 +75,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

(openai-compatible-server)=
(quickstart-online)=

## OpenAI-Compatible Server

Expand Down
49 changes: 28 additions & 21 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,32 +65,14 @@ getting_started/troubleshooting
getting_started/faq
```

```{toctree}
:caption: Serving
:maxdepth: 1

serving/openai_compatible_server
serving/deploying_with_docker
serving/deploying_with_k8s
serving/deploying_with_helm
serving/deploying_with_nginx
serving/distributed_serving
serving/metrics
serving/integrations
serving/tensorizer
serving/runai_model_streamer
serving/engine_args
serving/env_vars
serving/usage_stats
```

```{toctree}
:caption: Models
:maxdepth: 1

models/supported_models
models/generative_models
models/pooling_models
models/supported_models
models/extensions/index
```

```{toctree}
Expand All @@ -99,7 +81,6 @@ models/pooling_models

features/quantization/index
features/lora
features/multimodal_inputs
features/tool_calling
features/structured_outputs
features/automatic_prefix_caching
Expand All @@ -108,6 +89,32 @@ features/spec_decode
features/compatibility_matrix
```

```{toctree}
:caption: Inference and Serving
:maxdepth: 1

serving/offline_inference
serving/openai_compatible_server
serving/multimodal_inputs
serving/distributed_serving
serving/metrics
serving/engine_args
serving/env_vars
serving/usage_stats
serving/integrations/index
```

```{toctree}
:caption: Deployment
:maxdepth: 1

deployment/docker
deployment/k8s
deployment/nginx
deployment/frameworks/index
deployment/integrations/index
```

```{toctree}
:caption: Performance
:maxdepth: 1
Expand Down
8 changes: 8 additions & 0 deletions docs/source/models/extensions/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Built-in Extensions

```{toctree}
:maxdepth: 1

runai_model_streamer
tensorizer
```
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(runai-model-streamer)=

# Loading Models with Run:ai Model Streamer
# Loading models with Run:ai Model Streamer

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(tensorizer)=

# Loading Models with CoreWeave's Tensorizer
# Loading models with CoreWeave's Tensorizer

vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
Expand Down
Loading
Loading