Skip to content

Commit

Permalink
[Doc] Show default pooling method in a table (#11904)
Browse files Browse the repository at this point in the history
Signed-off-by: DarkLight1337 <[email protected]>
  • Loading branch information
DarkLight1337 authored Jan 10, 2025
1 parent b844b99 commit 3de2b1e
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 22 deletions.
8 changes: 4 additions & 4 deletions docs/source/models/generative_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ In vLLM, generative models implement the {class}`~vllm.model_executor.models.Vll
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
which are then passed through {class}`~vllm.model_executor.layers.Sampler` to obtain the final text.

For generative models, the only supported `--task` option is `"generate"`.
Usually, this is automatically inferred so you don't have to specify it.

## Offline Inference

The {class}`~vllm.LLM` class provides various methods for offline inference.
See [Engine Arguments](#engine-args) for a list of options when initializing the model.

For generative models, the only supported {code}`task` option is {code}`"generate"`.
Usually, this is automatically inferred so you don't have to specify it.

### `LLM.generate`

The {class}`~vllm.LLM.generate` method is available to all generative models in vLLM.
Expand All @@ -33,7 +33,7 @@ for output in outputs:
```

You can optionally control the language generation by passing {class}`~vllm.SamplingParams`.
For example, you can use greedy sampling by setting {code}`temperature=0`:
For example, you can use greedy sampling by setting `temperature=0`:

```python
llm = LLM(model="facebook/opt-125m")
Expand Down
59 changes: 41 additions & 18 deletions docs/source/models/pooling_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,30 +14,53 @@ As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM feature
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
```

## Offline Inference

The {class}`~vllm.LLM` class provides various methods for offline inference.
See [Engine Arguments](#engine-args) for a list of options when initializing the model.

For pooling models, we support the following {code}`task` options:

- Embedding ({code}`"embed"` / {code}`"embedding"`)
- Classification ({code}`"classify"`)
- Sentence Pair Scoring ({code}`"score"`)
- Reward Modeling ({code}`"reward"`)
For pooling models, we support the following `--task` options.
The selected option sets the default pooler used to extract the final hidden states:

```{list-table}
:widths: 50 25 25 25
:header-rows: 1
* - Task
- Pooling Type
- Normalization
- Softmax
* - Embedding (`embed`)
- `LAST`
- ✅︎
- ✗
* - Classification (`classify`)
- `LAST`
- ✗
- ✅︎
* - Sentence Pair Scoring (`score`)
- \*
- \*
- \*
* - Reward Modeling (`reward`)
- `ALL`
- ✗
- ✗
```

The selected task determines the default {class}`~vllm.model_executor.layers.Pooler` that is used:
\*The default pooler is always defined by the model.

- Embedding: Extract only the hidden states corresponding to the last token, and apply normalization.
- Classification: Extract only the hidden states corresponding to the last token, and apply softmax.
- Sentence Pair Scoring: Extract only the hidden states corresponding to the last token, and apply softmax.
- Reward Modeling: Extract all of the hidden states and return them directly.
```{note}
If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table.
```

When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
we attempt to override the default pooler based on its Sentence Transformers configuration file ({code}`modules.json`).
we attempt to override the default pooler based on its Sentence Transformers configuration file (`modules.json`).

You can customize the model's pooling method via the {code}`override_pooler_config` option,
```{tip}
You can customize the model's pooling method via the `--override-pooler-config` option,
which takes priority over both the model's and Sentence Transformers's defaults.
```

## Offline Inference

The {class}`~vllm.LLM` class provides various methods for offline inference.
See [Engine Arguments](#engine-args) for a list of options when initializing the model.

### `LLM.encode`

Expand Down

0 comments on commit 3de2b1e

Please sign in to comment.