Update doc for server arguments #2742

simveit · 2025-01-05T19:37:20Z

Motivation

As explained here the current documentation of the backend needs update which we intend to implement here.

Checklist

Update documentation as needed, including docstrings or example tutorials.

zhaochenyang20

I love the detailed and educational docs of parameters. Having two suggestions:

We are documenting the official usage, so we can move the educational part to other unoffical repos, like my ML sys tutorial. 😂
Keep the things concise. If we want to explain the concept, I think just one sentence of educational explanation and give a link to details, which could be bettter.

zhaochenyang20 · 2025-01-06T19:34:10Z

docs/backend/server_arguments.md

-```
-
-</details>
+## Model and tokenizer


Cool. But for the docs, always keep one first-order title # and several second-order title ##, do not use forth-order title ####.

Adjusted to include Server Arguments title.

docs/backend/server_arguments.md

zhaochenyang20 · 2025-01-06T20:09:16Z

docs/backend/server_arguments.md

+
+* `tp_size`: This parameter is important if we have multiple GPUs and our model doesn't fit on a single GPU. *Tensor parallelism* means we distribute our model weights over multiple GPUs. Note that his technique is mainly aimed at *memory efficency* and not at a *higher throughput* as there is inter GPU communication needed to obtain the final output of each layer. For better understanding of the concept you may look for example [here](https://pytorch.org/tutorials/intermediate/TP_tutorial.html#how-tensor-parallel-works).
+
+* `stream_interval`: If we stream the output to the user this parameter determines at which interval we perform streaming. The interval length is measured in tokens.


I am not so sure. Could you double check this and make it more clear.

I will look this more carefully up. For now I left it as to do and come back to it at the end.

docs/backend/server_arguments.md

zhaochenyang20 · 2025-01-06T20:10:44Z

docs/backend/server_arguments.md

+
+* `random_seed`: Can be used to enforce deterministic behavior. 
+
+* `constrained_json_whitespace_pattern`: When using `Outlines` grammar backend we can use this to allow JSON with syntatic newlines, tabs or multiple spaces.


I think we can create a ## for constraint decoding parameters.

In general I think that we could restructure the whole sections. I suggest to do that after I included all parameters.

docs/backend/server_arguments.md

…ments-docs

…nto feature/server-arguments-docs

zhaochenyang20

Perfect! Thanks so much for help!

docs/backend/server_arguments.md

zhaochenyang20 · 2025-01-14T23:51:45Z

docs/backend/server_arguments.md

+* `dist_init_addr`: The TCP address used for initializing PyTorch’s distributed backend (e.g. `192.168.0.2:25000`).
+* `nnodes`: Total number of nodes in the cluster.
+* `node_rank`: Rank (ID) of this node among the `nnodes` in the distributed setup.
+

 ## Model override args in JSON


better call this ## Constraint Decoding

docs/backend/server_arguments.md

…lang into feature/server-arguments-docs

zhaochenyang20

Amazing work! We are close to the end!

zhaochenyang20 · 2025-01-16T20:21:49Z

docs/backend/server_arguments.md

-```
-
-</details>
+## Model and tokenizer


docs/backend/server_arguments.md

zhaochenyang20

Great, we are close to the end. Are there any parameters left? If not, after fixing these parameters, we can let yineng to review.

docs/backend/server_arguments.md

zhaochenyang20

Great! We made it!

zhaochenyang20 · 2025-01-18T22:20:09Z

docs/backend/server_arguments.md

@@ -66,7 +66,7 @@ In this document we aim to give an overview of the possible arguments when deplo
 * `watchdog_timeout`: Adjusts the watchdog thread’s timeout before killing the server if batch generation takes too long.
 * `download_dir`: Use to override the default Hugging Face cache directory for model weights.
 * `base_gpu_id`: Use to adjust first GPU used to distribute the model across available GPUs.
-
+* `allow_auto_truncate`: Automatically truncate requests that exceed the maximum input length.


zhaochenyang20 · 2025-01-18T22:21:21Z

@zhyncs Wait for a final go over.

Edenzzzz · 2025-01-23T16:45:13Z

docs/backend/server_arguments.md

- To enable multi-GPU tensor parallelism, add `--tp 2`. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command.
-```
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
-```
- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total.
-```
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
-```
- If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`.
-```
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
-```
- See [hyperparameter tuning](../references/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
-```
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
-```
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currently.
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports other [quantization strategies (INT8/FP8)](https://github.com/sgl-project/sglang/blob/v0.3.6/python/sglang/srt/server_args.py#L671) as well.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.


In addition to explaining individual args, keep some of the popular launch commands/arg combinations here for plug and play?

In addition to explaining individual args, keep some of the popular launch commands/arg combinations here for plug and play?

Great. Suggestions. We should give some examples to command parameters. @simveit

Included previous doc with small adjustment to use Router in case of DP.

…lang into feature/server-arguments-docs

Added model arguments

452a766

simveit force-pushed the feature/server-arguments-docs branch 2 times, most recently from fbc1a63 to abb44cf Compare January 6, 2025 19:13

Added sections on tensor and data parallelism

0a288d7

simveit force-pushed the feature/server-arguments-docs branch from abb44cf to 0a288d7 Compare January 6, 2025 19:18

zhaochenyang20 requested changes Jan 6, 2025

View reviewed changes

Shortened existing texts, added more parameters

b939c56

simveit force-pushed the feature/server-arguments-docs branch from 58efd67 to b939c56 Compare January 8, 2025 16:51

simveit added 4 commits January 9, 2025 16:48

Merge remote-tracking branch 'upstream/main' into feature/server-argu…

9084cfe

…ments-docs

Added arguments

4c0ae15

Merge remote-tracking branch 'upstream/main' into feature/server-argu…

17819d2

…ments-docs

Merge remote-tracking branch 'origin/feature/server-arguments-docs' i…

49b0815

…nto feature/server-arguments-docs

zhaochenyang20 requested changes Jan 14, 2025

View reviewed changes

zhaochenyang20 and others added 7 commits January 14, 2025 15:56

Merge branch 'main' into feature/server-arguments-docs

5cf32be

Merge branch 'sgl-project:main' into feature/server-arguments-docs

15a3e87

Adjusted descriptions

ce845c4

Merge branch 'sgl-project:main' into feature/server-arguments-docs

efa4e9c

Merge branch 'feature/server-arguments-docs' of github.com:simveit/sg…

d190c6e

…lang into feature/server-arguments-docs

Completed all arguments

30e818e

Remove argument

8759299

zhaochenyang20 requested changes Jan 16, 2025

View reviewed changes

simveit added 2 commits January 18, 2025 16:46

Refined argument description

342aa20

Merge branch 'main' into feature/server-arguments-docs

95ce466

zhaochenyang20 requested changes Jan 18, 2025

View reviewed changes

Adjusted descriptions.

452acac

zhaochenyang20 reviewed Jan 18, 2025

View reviewed changes

zhaochenyang20 marked this pull request as ready for review January 18, 2025 22:20

Merge branch 'main' into feature/server-arguments-docs

9e4788b

zhyncs requested a review from merrymercy January 20, 2025 18:07

zhyncs requested review from zhyncs and hnyls2002 January 20, 2025 18:07

Merge branch 'main' into feature/server-arguments-docs

9d20887

Edenzzzz reviewed Jan 23, 2025

View reviewed changes

simveit added 3 commits January 23, 2025 20:00

Merge branch 'main' into feature/server-arguments-docs

e5ef947

Added common launch commands from previous doc.

3e64a3f

Merge branch 'feature/server-arguments-docs' of github.com:simveit/sg…

93ae630

…lang into feature/server-arguments-docs

zhaochenyang20 approved these changes Jan 23, 2025

View reviewed changes

Linted

4425c28

zhaochenyang20 merged commit 1c4e0d2 into sgl-project:main Jan 23, 2025
1 of 2 checks passed

simveit deleted the feature/server-arguments-docs branch January 24, 2025 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update doc for server arguments #2742

Update doc for server arguments #2742

simveit commented Jan 5, 2025

zhaochenyang20 left a comment

zhaochenyang20 Jan 6, 2025

simveit Jan 8, 2025

zhaochenyang20 Jan 16, 2025

zhaochenyang20 Jan 6, 2025

simveit Jan 8, 2025

zhaochenyang20 Jan 8, 2025

zhaochenyang20 Jan 6, 2025

simveit Jan 8, 2025

zhaochenyang20 left a comment

zhaochenyang20 Jan 14, 2025

zhaochenyang20 left a comment

zhaochenyang20 Jan 16, 2025

zhaochenyang20 left a comment

zhaochenyang20 left a comment

zhaochenyang20 Jan 18, 2025

zhaochenyang20 commented Jan 18, 2025

Edenzzzz Jan 23, 2025 •

edited

Loading

zhaochenyang20 Jan 23, 2025

simveit Jan 23, 2025


		* `tp_size`: This parameter is important if we have multiple GPUs and our model doesn't fit on a single GPU. Tensor parallelism means we distribute our model weights over multiple GPUs. Note that his technique is mainly aimed at memory efficency and not at a higher throughput as there is inter GPU communication needed to obtain the final output of each layer. For better understanding of the concept you may look for example [here](https://pytorch.org/tutorials/intermediate/TP_tutorial.html#how-tensor-parallel-works).

		* `stream_interval`: If we stream the output to the user this parameter determines at which interval we perform streaming. The interval length is measured in tokens.


		* `random_seed`: Can be used to enforce deterministic behavior.

		* `constrained_json_whitespace_pattern`: When using `Outlines` grammar backend we can use this to allow JSON with syntatic newlines, tabs or multiple spaces.

Update doc for server arguments #2742

Update doc for server arguments #2742

Conversation

simveit commented Jan 5, 2025

Motivation

Checklist

zhaochenyang20 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaochenyang20 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaochenyang20 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaochenyang20 left a comment

Choose a reason for hiding this comment

zhaochenyang20 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaochenyang20 commented Jan 18, 2025

Edenzzzz Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Edenzzzz Jan 23, 2025 •

edited

Loading