Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] Show default pooling method in a table #11904

Merged
merged 2 commits into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/source/models/generative_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ In vLLM, generative models implement the {class}`~vllm.model_executor.models.Vll
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
which are then passed through {class}`~vllm.model_executor.layers.Sampler` to obtain the final text.

For generative models, the only supported `--task` option is `"generate"`.
Usually, this is automatically inferred so you don't have to specify it.

## Offline Inference

The {class}`~vllm.LLM` class provides various methods for offline inference.
See [Engine Arguments](#engine-args) for a list of options when initializing the model.

For generative models, the only supported {code}`task` option is {code}`"generate"`.
Usually, this is automatically inferred so you don't have to specify it.

### `LLM.generate`

The {class}`~vllm.LLM.generate` method is available to all generative models in vLLM.
Expand All @@ -33,7 +33,7 @@ for output in outputs:
```

You can optionally control the language generation by passing {class}`~vllm.SamplingParams`.
For example, you can use greedy sampling by setting {code}`temperature=0`:
For example, you can use greedy sampling by setting `temperature=0`:

```python
llm = LLM(model="facebook/opt-125m")
Expand Down
59 changes: 41 additions & 18 deletions docs/source/models/pooling_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,30 +14,53 @@ As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM feature
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
```

## Offline Inference

The {class}`~vllm.LLM` class provides various methods for offline inference.
See [Engine Arguments](#engine-args) for a list of options when initializing the model.

For pooling models, we support the following {code}`task` options:

- Embedding ({code}`"embed"` / {code}`"embedding"`)
- Classification ({code}`"classify"`)
- Sentence Pair Scoring ({code}`"score"`)
- Reward Modeling ({code}`"reward"`)
For pooling models, we support the following `--task` options.
The selected option sets the default pooler used to extract the final hidden states:

```{list-table}
:widths: 50 25 25 25
:header-rows: 1

* - Task
- Pooling Type
- Normalization
- Softmax
* - Embedding (`embed`)
- `LAST`
- ✅︎
- ✗
* - Classification (`classify`)
- `LAST`
- ✗
- ✅︎
* - Sentence Pair Scoring (`score`)
- \*
- \*
- \*
* - Reward Modeling (`reward`)
- `ALL`
- ✗
- ✗
```

The selected task determines the default {class}`~vllm.model_executor.layers.Pooler` that is used:
\*The default pooler is always defined by the model.

- Embedding: Extract only the hidden states corresponding to the last token, and apply normalization.
- Classification: Extract only the hidden states corresponding to the last token, and apply softmax.
- Sentence Pair Scoring: Extract only the hidden states corresponding to the last token, and apply softmax.
- Reward Modeling: Extract all of the hidden states and return them directly.
```{note}
If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table.
```

When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
we attempt to override the default pooler based on its Sentence Transformers configuration file ({code}`modules.json`).
we attempt to override the default pooler based on its Sentence Transformers configuration file (`modules.json`).

You can customize the model's pooling method via the {code}`override_pooler_config` option,
```{tip}
You can customize the model's pooling method via the `--override-pooler-config` option,
which takes priority over both the model's and Sentence Transformers's defaults.
```

## Offline Inference

The {class}`~vllm.LLM` class provides various methods for offline inference.
See [Engine Arguments](#engine-args) for a list of options when initializing the model.

### `LLM.encode`

Expand Down
Loading