Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update doc for server arguments #2742

Merged
merged 23 commits into from
Jan 23, 2025

Conversation

simveit
Copy link
Contributor

@simveit simveit commented Jan 5, 2025

Motivation

As explained here the current documentation of the backend needs update which we intend to implement here.

Checklist

  • Update documentation as needed, including docstrings or example tutorials.

@simveit simveit force-pushed the feature/server-arguments-docs branch 2 times, most recently from fbc1a63 to abb44cf Compare January 6, 2025 19:13
@simveit simveit force-pushed the feature/server-arguments-docs branch from abb44cf to 0a288d7 Compare January 6, 2025 19:18
Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the detailed and educational docs of parameters. Having two suggestions:

  1. We are documenting the official usage, so we can move the educational part to other unoffical repos, like my ML sys tutorial. 😂
  2. Keep the things concise. If we want to explain the concept, I think just one sentence of educational explanation and give a link to details, which could be bettter.

```

</details>
## Model and tokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. But for the docs, always keep one first-order title # and several second-order title ##, do not use forth-order title ####.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted to include Server Arguments title.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect

docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved

* `tp_size`: This parameter is important if we have multiple GPUs and our model doesn't fit on a single GPU. *Tensor parallelism* means we distribute our model weights over multiple GPUs. Note that his technique is mainly aimed at *memory efficency* and not at a *higher throughput* as there is inter GPU communication needed to obtain the final output of each layer. For better understanding of the concept you may look for example [here](https://pytorch.org/tutorials/intermediate/TP_tutorial.html#how-tensor-parallel-works).

* `stream_interval`: If we stream the output to the user this parameter determines at which interval we perform streaming. The interval length is measured in tokens.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not so sure. Could you double check this and make it more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look this more carefully up. For now I left it as to do and come back to it at the end.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

docs/backend/server_arguments.md Outdated Show resolved Hide resolved

* `random_seed`: Can be used to enforce deterministic behavior.

* `constrained_json_whitespace_pattern`: When using `Outlines` grammar backend we can use this to allow JSON with syntatic newlines, tabs or multiple spaces.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can create a ## for constraint decoding parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think that we could restructure the whole sections. I suggest to do that after I included all parameters.

docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
@simveit simveit force-pushed the feature/server-arguments-docs branch from 58efd67 to b939c56 Compare January 8, 2025 16:51
Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! Thanks so much for help!

docs/backend/server_arguments.md Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
* `dist_init_addr`: The TCP address used for initializing PyTorch’s distributed backend (e.g. `192.168.0.2:25000`).
* `nnodes`: Total number of nodes in the cluster.
* `node_rank`: Rank (ID) of this node among the `nnodes` in the distributed setup.


## Model override args in JSON
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better call this ## Constraint Decoding

docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work! We are close to the end!

```

</details>
## Model and tokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect

docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, we are close to the end. Are there any parameters left? If not, after fixing these parameters, we can let yineng to review.

docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
docs/backend/server_arguments.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! We made it!

@@ -66,7 +66,7 @@ In this document we aim to give an overview of the possible arguments when deplo
* `watchdog_timeout`: Adjusts the watchdog thread’s timeout before killing the server if batch generation takes too long.
* `download_dir`: Use to override the default Hugging Face cache directory for model weights.
* `base_gpu_id`: Use to adjust first GPU used to distribute the model across available GPUs.

* `allow_auto_truncate`: Automatically truncate requests that exceed the maximum input length.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@zhaochenyang20 zhaochenyang20 marked this pull request as ready for review January 18, 2025 22:20
@zhaochenyang20
Copy link
Collaborator

@zhyncs Wait for a final go over.

@zhyncs zhyncs requested a review from merrymercy January 20, 2025 18:07
@zhyncs zhyncs requested review from zhyncs and hnyls2002 January 20, 2025 18:07
Comment on lines 3 to 23
- To enable multi-GPU tensor parallelism, add `--tp 2`. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
```
- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
```
- If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
```
- See [hyperparameter tuning](../references/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
```
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currently.
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports other [quantization strategies (INT8/FP8)](https://github.com/sgl-project/sglang/blob/v0.3.6/python/sglang/srt/server_args.py#L671) as well.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
Copy link

@Edenzzzz Edenzzzz Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to explaining individual args, keep some of the popular launch commands/arg combinations here for plug and play?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to explaining individual args, keep some of the popular launch commands/arg combinations here for plug and play?

Great. Suggestions. We should give some examples to command parameters. @simveit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included previous doc with small adjustment to use Router in case of DP.

@zhaochenyang20 zhaochenyang20 merged commit 1c4e0d2 into sgl-project:main Jan 23, 2025
1 of 2 checks passed
@simveit simveit deleted the feature/server-arguments-docs branch January 24, 2025 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants