Skip to content

Commit

Permalink
Move Community and API Reference to the bottom
Browse files Browse the repository at this point in the history
Signed-off-by: DarkLight1337 <[email protected]>
  • Loading branch information
DarkLight1337 committed Jan 9, 2025
1 parent 65097ca commit 299f02a
Show file tree
Hide file tree
Showing 3 changed files with 40 additions and 26 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:

- State-of-the-art serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Efficient management of attention key and value memory with [**PagedAttention**](https://vllm.ai)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/design/automatic_prefix_caching.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

# Automatic Prefix Caching

The core idea of [PagedAttention](#design-paged-attention) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
The core idea of [PagedAttention](https://vllm.ai) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.

To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.

Expand Down
62 changes: 38 additions & 24 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:

- State-of-the-art serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Efficient management of attention key and value memory with [**PagedAttention**](https://vllm.ai)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
Expand Down Expand Up @@ -54,6 +54,8 @@ For more information, check out the following:

## Documentation

% How to start using vLLM?

```{toctree}
:caption: Getting Started
:maxdepth: 1
Expand All @@ -65,6 +67,8 @@ getting_started/troubleshooting
getting_started/faq
```

% What does vLLM support?

```{toctree}
:caption: Models
:maxdepth: 1
Expand All @@ -75,6 +79,8 @@ models/supported_models
models/extensions/index
```

% Additional capabilities

```{toctree}
:caption: Features
:maxdepth: 1
Expand All @@ -89,6 +95,8 @@ features/spec_decode
features/compatibility_matrix
```

% Details about running vLLM

```{toctree}
:caption: Inference and Serving
:maxdepth: 1
Expand All @@ -104,6 +112,8 @@ serving/usage_stats
serving/integrations/index
```

% Scaling up vLLM for production

```{toctree}
:caption: Deployment
:maxdepth: 1
Expand All @@ -115,6 +125,8 @@ deployment/frameworks/index
deployment/integrations/index
```

% Making the most out of vLLM

```{toctree}
:caption: Performance
:maxdepth: 1
Expand All @@ -123,28 +135,7 @@ performance/optimization
performance/benchmarks
```

% Community: User community resources

```{toctree}
:caption: Community
:maxdepth: 1
community/meetups
community/sponsors
```

```{toctree}
:caption: API Reference
:maxdepth: 2
api/offline_inference/index
api/engine/index
api/inference_params
api/multimodal/index
api/model/index
```

% Design Documents: Details about vLLM internals
% Explanation of vLLM internals

```{toctree}
:caption: Design Documents
Expand All @@ -159,7 +150,7 @@ design/automatic_prefix_caching
design/multiprocessing
```

% Developer Guide: How to contribute to the vLLM project
% How to contribute to the vLLM project

```{toctree}
:caption: Developer Guide
Expand All @@ -172,6 +163,29 @@ contributing/model/index
contributing/vulnerability_management
```

% Technical API specifications

```{toctree}
:caption: API Reference
:maxdepth: 2
api/offline_inference/index
api/engine/index
api/inference_params
api/multimodal/index
api/model/index
```

% Latest news and acknowledgements

```{toctree}
:caption: Community
:maxdepth: 1
community/meetups
community/sponsors
```

# Indices and tables

- {ref}`genindex`
Expand Down

0 comments on commit 299f02a

Please sign in to comment.