Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc][5/N] Move Community and API Reference to the bottom #11896

Merged
merged 2 commits into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:

- State-of-the-art serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/design/automatic_prefix_caching.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

# Automatic Prefix Caching

The core idea of [PagedAttention](#design-paged-attention) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.

To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.

Expand Down
62 changes: 38 additions & 24 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:

- State-of-the-art serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
Expand Down Expand Up @@ -54,6 +54,8 @@ For more information, check out the following:

## Documentation

% How to start using vLLM?

```{toctree}
:caption: Getting Started
:maxdepth: 1
Expand All @@ -65,6 +67,8 @@ getting_started/troubleshooting
getting_started/faq
```

% What does vLLM support?

```{toctree}
:caption: Models
:maxdepth: 1
Expand All @@ -75,6 +79,8 @@ models/supported_models
models/extensions/index
```

% Additional capabilities

```{toctree}
:caption: Features
:maxdepth: 1
Expand All @@ -89,6 +95,8 @@ features/spec_decode
features/compatibility_matrix
```

% Details about running vLLM

```{toctree}
:caption: Inference and Serving
:maxdepth: 1
Expand All @@ -104,6 +112,8 @@ serving/usage_stats
serving/integrations/index
```

% Scaling up vLLM for production

```{toctree}
:caption: Deployment
:maxdepth: 1
Expand All @@ -115,6 +125,8 @@ deployment/frameworks/index
deployment/integrations/index
```

% Making the most out of vLLM

```{toctree}
:caption: Performance
:maxdepth: 1
Expand All @@ -123,28 +135,7 @@ performance/optimization
performance/benchmarks
```

% Community: User community resources

```{toctree}
:caption: Community
:maxdepth: 1

community/meetups
community/sponsors
```

```{toctree}
:caption: API Reference
:maxdepth: 2

api/offline_inference/index
api/engine/index
api/inference_params
api/multimodal/index
api/model/index
```

% Design Documents: Details about vLLM internals
% Explanation of vLLM internals

```{toctree}
:caption: Design Documents
Expand All @@ -159,7 +150,7 @@ design/automatic_prefix_caching
design/multiprocessing
```

% Developer Guide: How to contribute to the vLLM project
% How to contribute to the vLLM project

```{toctree}
:caption: Developer Guide
Expand All @@ -172,6 +163,29 @@ contributing/model/index
contributing/vulnerability_management
```

% Technical API specifications

```{toctree}
:caption: API Reference
:maxdepth: 2

api/offline_inference/index
api/engine/index
api/inference_params
api/multimodal/index
api/model/index
```

% Latest news and acknowledgements

```{toctree}
:caption: Community
:maxdepth: 1

community/meetups
community/sponsors
```

# Indices and tables

- {ref}`genindex`
Expand Down
Loading