Skip to content

Commit

Permalink
Reorganize Serving section
Browse files Browse the repository at this point in the history
Signed-off-by: DarkLight1337 <[email protected]>
  • Loading branch information
DarkLight1337 committed Jan 6, 2025
1 parent 996357e commit 7923116
Show file tree
Hide file tree
Showing 31 changed files with 191 additions and 80 deletions.
2 changes: 1 addition & 1 deletion docs/source/contributing/dockerfile/dockerfile.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Dockerfile

We provide a <gh-file:Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
More information about deploying with Docker can be found [here](../../serving/deploying_with_docker.md).
More information about deploying with Docker can be found [here](#deployment-docker).

Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-docker)=
(deployment-docker)=

# Deploying with Docker
# Using Docker

## Use vLLM's Official Docker Image

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-bentoml)=
(deployment-bentoml)=

# Deploying with BentoML
# BentoML

[BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-complicant image and deploy it on Kubernetes.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-cerebrium)=
(deployment-cerebrium)=

# Deploying with Cerebrium
# Cerebrium

```{raw} html
<p align="center">
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-dstack)=
(deployment-dstack)=

# Deploying with dstack
# dstack

```{raw} html
<p align="center">
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-helm)=
(deployment-helm)=

# Deploying with Helm
# Helm

A Helm chart to deploy vLLM for Kubernetes

Expand Down Expand Up @@ -38,7 +38,7 @@ chart **including persistent volumes** and deletes the release.

## Architecture

```{image} architecture_helm_deployment.png
```{image} /assets/deployment/architecture_helm_deployment.png
```

## Values
Expand Down
13 changes: 13 additions & 0 deletions docs/source/deployment/frameworks/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Using other frameworks

```{toctree}
:maxdepth: 1
bentoml
cerebrium
dstack
helm
lws
skypilot
triton
```
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-lws)=
(deployment-lws)=

# Deploying with LWS
# LWS

LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
A major use case is for multi-host/multi-node distributed inference.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(on-cloud)=
(deployment-skypilot)=

# Deploying and scaling up with SkyPilot
# SkyPilot

```{raw} html
<p align="center">
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
(deploying-with-triton)=
(deployment-triton)=

# Deploying with NVIDIA Triton
# NVIDIA Triton

The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details.
9 changes: 9 additions & 0 deletions docs/source/deployment/integrations/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# External integrations

```{toctree}
:maxdepth: 1
kserve
kubeai
llamastack
```
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-kserve)=
(deployment-kserve)=

# Deploying with KServe
# KServe

vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-kubeai)=
(deployment-kubeai)=

# Deploying with KubeAI
# KubeAI

[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(run-on-llamastack)=
(deployment-llamastack)=

# Serving with Llama Stack
# Llama Stack

vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(deploying-with-k8s)=
(deployment-k8s)=

# Deploying with Kubernetes
# Using Kubernetes

Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.

Expand Down Expand Up @@ -43,7 +43,7 @@ metadata:
name: hf-token-secret
namespace: default
type: Opaque
stringData:
data:
token: "REPLACE_WITH_TOKEN"
```
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(nginxloadbalancer)=

# Deploying with Nginx Loadbalancer
# Using Nginx

This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/getting_started/installation/hpu-gaudi.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ $ python setup.py develop

## Supported Features

- [Offline batched inference](#offline-batched-inference)
- [Offline inference](#offline-inference)
- Online inference via [OpenAI-Compatible Server](#openai-compatible-server)
- HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
Expand Down
22 changes: 12 additions & 10 deletions docs/source/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,34 @@

# Quickstart

This guide will help you quickly get started with vLLM to:
This guide will help you quickly get started with vLLM to perform:

- [Run offline batched inference](#offline-batched-inference)
- [Run OpenAI-compatible inference](#openai-compatible-server)
- [Offline batched inference](#quickstart-offline)
- [Online inference using OpenAI-compatible server](#quickstart-online)

## Prerequisites

- OS: Linux
- Python: 3.9 -- 3.12
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)

## Installation

You can install vLLM using pip. It's recommended to use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.
If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/project/vllm/) directly.
It's recommended to use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.

```console
$ conda create -n myenv python=3.10 -y
$ conda activate myenv
$ pip install vllm
```

Please refer to the [installation documentation](#installation-index) for more details on installing vLLM.
```{note}
For non-CUDA platforms, please refer [here](#installation-index) for specific instructions on how to install vLLM.
```

(offline-batched-inference)=
(quickstart-offline)=

## Offline Batched Inference
## Offline batched inference

With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference.py>

Expand Down Expand Up @@ -73,9 +75,9 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

(openai-compatible-server)=
(quickstart-online)=

## OpenAI-Compatible Server
## OpenAI-compatible server

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
By default, it starts the server at `http://localhost:8000`. You can specify the address with `--host` and `--port` arguments. The server currently hosts one model at a time and implements endpoints such as [list models](https://platform.openai.com/docs/api-reference/models/list), [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create), and [create completion](https://platform.openai.com/docs/api-reference/completions/create) endpoints.
Expand Down
25 changes: 16 additions & 9 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,19 +66,26 @@ getting_started/faq
```

```{toctree}
:caption: Serving
:caption: Inference and Serving
:maxdepth: 1
serving/offline_inference
serving/openai_compatible_server
serving/deploying_with_docker
serving/deploying_with_k8s
serving/deploying_with_helm
serving/deploying_with_nginx
serving/distributed_serving
serving/metrics
serving/integrations
serving/tensorizer
serving/runai_model_streamer
serving/integrations/index
serving/multimodal_inputs
```

```{toctree}
:caption: Deployment
:maxdepth: 1
deployment/docker
deployment/k8s
deployment/nginx
deployment/frameworks/index
deployment/integrations/index
```

```{toctree}
Expand All @@ -90,14 +97,14 @@ models/generative_models
models/pooling_models
models/adding_model
models/enabling_multimodal_inputs
models/loaders/index
```

```{toctree}
:caption: Usage
:maxdepth: 1
usage/lora
usage/multimodal_inputs
usage/tool_calling
usage/structured_outputs
usage/spec_decode
Expand Down
8 changes: 8 additions & 0 deletions docs/source/models/loaders/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Alternative model loaders

```{toctree}
:maxdepth: 1
runai_model_streamer
tensorizer
```
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(runai-model-streamer)=

# Loading Models with Run:ai Model Streamer
# Run:ai Model Streamer

Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(tensorizer)=

# Loading Models with CoreWeave's Tensorizer
# Tensorizer

vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
Expand Down
17 changes: 0 additions & 17 deletions docs/source/serving/integrations.md

This file was deleted.

8 changes: 8 additions & 0 deletions docs/source/serving/integrations/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# External integrations

```{toctree}
:maxdepth: 1
langchain
llamaindex
```
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
(run-on-langchain)=
(serving-langchain)=

# Serving with Langchain
# LangChain

vLLM is also available via [Langchain](https://github.com/langchain-ai/langchain) .
vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .

To install langchain, run
To install LangChain, run

```console
$ pip install langchain langchain_community -q
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
(run-on-llamaindex)=
(serving-llamaindex)=

# Serving with llama_index
# LlamaIndex

vLLM is also available via [llama_index](https://github.com/run-llama/llama_index) .
vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .

To install llamaindex, run
To install LlamaIndex, run

```console
$ pip install llama-index-llms-vllm -q
Expand Down
2 changes: 1 addition & 1 deletion docs/source/serving/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ vLLM exposes a number of metrics that can be used to monitor the health of the
system. These metrics are exposed via the `/metrics` endpoint on the vLLM
OpenAI compatible API server.

You can start the server using Python, or using [Docker](deploying_with_docker.md):
You can start the server using Python, or using [Docker](#deployment-docker):

```console
$ vllm serve unsloth/Llama-3.2-1B-Instruct
Expand Down
File renamed without changes.
Loading

0 comments on commit 7923116

Please sign in to comment.