diff --git a/README.md b/README.md index 253a0bb913e37..993fd6801fa35 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,7 @@ vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput -- Efficient management of attention key and value memory with **PagedAttention** +- Efficient management of attention key and value memory with [**PagedAttention**](https://vllm.ai) - Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8. diff --git a/docs/source/design/automatic_prefix_caching.md b/docs/source/design/automatic_prefix_caching.md index 4398536b2b4ad..69498fe6c6be5 100644 --- a/docs/source/design/automatic_prefix_caching.md +++ b/docs/source/design/automatic_prefix_caching.md @@ -2,7 +2,7 @@ # Automatic Prefix Caching -The core idea of [PagedAttention](#design-paged-attention) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand. +The core idea of [PagedAttention](https://vllm.ai) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand. To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block. diff --git a/docs/source/index.md b/docs/source/index.md index 23e4304fe29d9..ad94994c53688 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -26,7 +26,7 @@ vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: - State-of-the-art serving throughput -- Efficient management of attention key and value memory with **PagedAttention** +- Efficient management of attention key and value memory with [**PagedAttention**](https://vllm.ai) - Continuous batching of incoming requests - Fast model execution with CUDA/HIP graph - Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8 @@ -54,6 +54,8 @@ For more information, check out the following: ## Documentation +% For newcomers + ```{toctree} :caption: Getting Started :maxdepth: 1 @@ -65,6 +67,8 @@ getting_started/troubleshooting getting_started/faq ``` +% What does vLLM support? + ```{toctree} :caption: Models :maxdepth: 1 @@ -75,6 +79,8 @@ models/supported_models models/extensions/index ``` +% Additional capabilities + ```{toctree} :caption: Features :maxdepth: 1 @@ -89,6 +95,8 @@ features/spec_decode features/compatibility_matrix ``` +% Running vLLM + ```{toctree} :caption: Inference and Serving :maxdepth: 1 @@ -104,6 +112,8 @@ serving/usage_stats serving/integrations/index ``` +% Scaling up vLLM for production + ```{toctree} :caption: Deployment :maxdepth: 1 @@ -115,6 +125,8 @@ deployment/frameworks/index deployment/integrations/index ``` +% Making the most out of vLLM + ```{toctree} :caption: Performance :maxdepth: 1 @@ -123,28 +135,7 @@ performance/optimization performance/benchmarks ``` -% Community: User community resources - -```{toctree} -:caption: Community -:maxdepth: 1 - -community/meetups -community/sponsors -``` - -```{toctree} -:caption: API Reference -:maxdepth: 2 - -api/offline_inference/index -api/engine/index -api/inference_params -api/multimodal/index -api/model/index -``` - -% Design Documents: Details about vLLM internals +% Details about vLLM internals ```{toctree} :caption: Design Documents @@ -159,7 +150,7 @@ design/automatic_prefix_caching design/multiprocessing ``` -% Developer Guide: How to contribute to the vLLM project +% How to contribute to the vLLM project ```{toctree} :caption: Developer Guide @@ -172,6 +163,29 @@ contributing/model/index contributing/vulnerability_management ``` +% Technical API specifications + +```{toctree} +:caption: API Reference +:maxdepth: 2 + +api/offline_inference/index +api/engine/index +api/inference_params +api/multimodal/index +api/model/index +``` + +% Latest news and acknowledgements + +```{toctree} +:caption: Community +:maxdepth: 1 + +community/meetups +community/sponsors +``` + # Indices and tables - {ref}`genindex`