diff --git a/benchmark/benchmark_dsv3/README.md b/benchmark/benchmark_dsv3/README.md new file mode 100644 index 00000000000..265a805d5b6 --- /dev/null +++ b/benchmark/benchmark_dsv3/README.md @@ -0,0 +1,153 @@ +## Benchmark for SGLang v0.4.1 - DeepSeek v3 on Different H200 configurations + +We research the capabilites of two configurations of H200 NVIDIA GPUs: +- Single-node 8xH200 (BF16/FP8) +- Multi-node 2x8xH200 (BF16/FP8) + - using Infiniband (400Gbps) with `nccl=2.21.5` + +For the benchmarking, we choose as baseline parameters: + +- `--random-range-ratio 1` +- `--request-rate 1 ` +- `--random-input 1024` +- `--random-output 1024` + +Complete results and logs for benchmarks are in [https://github.com/datacrunch-research/h200-benchmarks](https://github.com/datacrunch-research/h200-benchmarks/commit/700675be3e55a62925f9c1a80f0b68ecf724ec13) + +## DeepSeek V3 on 8xH200 (single-node) + +### BF16 + +| RPS | Num Prompts | Median E2E Latency (ms) | Median TTFT (ms) | Median TPOT (ms) | Median ITL (ms) | Output token throughput (tok/s) | +| ---- | ----------- | ----------------------- | ---------------- | ---------------- | --------------- | ------------------------------- | +| 1 | 300 | 214,924.09 | 587.15 | 209.48 | 159.64 | 639.99 | +| 2 | 600 | 235,524.70 | 598.77 | 229.30 | 162.99 | 1313.74 | +| 4 | 1200 | 324,438.44 | 766.70 | 316.35 | 237.99 | 2378.26 | +| 8 | 2400 | 686,261.57 | 1191.74 | 516.67 | 255.96 | 2249.03 | + +### FP8 + +| RPS | Num Prompts | Median E2E Latency (ms) | Median TTFT (ms) | Median TPOT (ms) | Median ITL (ms) | Output token throughput (tok/s) | +| ---- | ----------- | ----------------------- | ---------------- | ---------------- | --------------- | ------------------------------- | +| 1 | 300 | 147,735.43 | 563.41 | 143.71 | 101.78 | 773.15 | +| 2 | 600 | 234,757.13 | 684.33 | 228.78 | 149.46 | 1401.77 | +| 4 | 1200 | 376,040.67 | 865.26 | 366.48 | 287.95 | 2214.76 | +| 8 | 2400 | 692,710.83 | 1358.77 | 675.95 | 515.18 | 2864.31 | + +## DeepSeek V3 on 2x8xH200 (multi-node) + +### BF16 + +| RPS | Num Prompts | Median E2E Latency (ms) | Median TTFT (ms) | Median TPOT (ms) | Median ITL (ms) | Output token throughput (tok/s) | +| ---- | ----------- | ----------------------- | ---------------- | ---------------- | --------------- | ------------------------------- | +| 1 | 300 | 971,353.97 | 53,189.54 | 843.03 | 638.68 | 275.06 | +| 2 | 600 | 2,010,951.23 | 313,373.93 | 1622.07 | 1192.37 | 256.50 | +| 4 | 1200 | 3,881,082.65 | 774,460.73 | 1645.51 | 1178.42 | 255.45 | +| 8 | 2400 | 6,819,185.61 | 4,072,706.72 | 2239.22 | 1205.60 | 250.08 | + +### FP8 + +| RPS | Num Prompts | Median E2E Latency (ms) | Median TTFT (ms) | Median TPOT (ms) | Median ITL (ms) | Output token throughput (tok/s) | +| ---- | ----------- | ----------------------- | ---------------- | ---------------- | --------------- | ------------------------------- | +| 1 | 300 | 985,610.62 | 56,824.07 | 862.84 | 662.33 | 271.60 | +| 2 | 600 | 1,975,371.99 | 305,318.37 | 1632.35 | 1219.14 | 288.41 | +| 4 | 1200 | 3,901,390.30 | 767,082.14 | 3023.99 | 2189.83 | 269.19 | +| 8 | 2400 | 7,374,173.14 | 1,680,440.41 | 2974.87 | 2007.02 | 276.74 | + +## Environment + +To guarantee benchmarking results reproducibility we execute all the experiments with the latest available SGLang Docker image. Build benchmarking environment running the following commands: + +```bash +$docker pull lmsysorg/sglang:dev + +$docker run -it -d --shm-size 32g --gpus all --net host \ +--env "HF_TOKEN=$HF_TOKEN" \ +-v :/root/.cache/huggingface \ +--ipc=host --name sglang_dev lmsysorg/sglang:latest bash + +$docker exec -it /bin/bash sglang_dev +``` + +## Notes + +Keep in mind the diferences in the commands for optimization techniques due to memory constrains. + +## Online benchmarks + +## DeepSeek V3 on 8xH200 (single-node) + +### BF16 + +```bash +# launch server +python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --enable-torch-compile --enable-dp-attention --mem-fraction-static 0.8 --disable-cuda-graph + + +# bench serving +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_BF16_online_output.jsonl + +``` + +### FP8 + +```bash +# launch server +python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 +--quantization fp8 --kv-cache-dtype fp8_e5m2 --trust-remote-code --enable-dp-attention + + +# bench serving +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl +``` +## Deepseek V3 on 2x8xH200 (multi-node) + +### BF16 + +```bash +# launch server +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --mem-fraction-static 0.8 --disable-cuda-graph + +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --mem-fraction-static 0.8 --disable-cuda-graph + + +# bench serving +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_BF16_online_output.jsonl +``` + +### FP8 + +```bash +# launch server +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --quantization fp8 --kv-cache-dtype fp8_e5m2 --disable-cuda-graph + +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --quantization fp8 --kv-cache-dtype fp8_e5m2 --disable-cuda-graph + + +# bench serving +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl +``` diff --git a/benchmark/benchmark_dsv3/deepseek_v3.sh b/benchmark/benchmark_dsv3/deepseek_v3.sh new file mode 100644 index 00000000000..d2fa25dd95d --- /dev/null +++ b/benchmark/benchmark_dsv3/deepseek_v3.sh @@ -0,0 +1,69 @@ +# Docker single-node command: (FP8 version) +: ' +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v /mnt/co-research/shared-models:/root/.cache/huggingface \ + --name sglang_singlenodeFP8 \ + -it \ + -rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --quantization fp8 --kv-cache-dtype fp8_e5m2 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-dp-attention +' + +# Docker multi-node command: (BF16 version) +# Node0: +: ' +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v /mnt/co-research/shared-models:/root/.cache/huggingface \ + --name sglang_multinode0 \ + -it \ + --rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 + +' + +# Node1: +: ' +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v /mnt/co-research/shared-models:/root/.cache/huggingface \ + --name sglang_multinode1 \ + -it \ + --rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 + +' + +# Docker basic client command: +: ' +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v /mnt/co-research/shared-models:/root/.cache/huggingface \ + --name sglang_bnchmrk_client \ + -it \ + --rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 1 --host 0.0.0.0 --port 40000 +' + +# 8xH200/2x8xH200 FP8/BF16 +# Online +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl diff --git a/benchmark/benchmark_v0.4.1.post4/README.md b/benchmark/benchmark_v0.4.1.post4/README.md new file mode 100644 index 00000000000..881cb7fdebc --- /dev/null +++ b/benchmark/benchmark_v0.4.1.post4/README.md @@ -0,0 +1,129 @@ +## Benchmark for SGLang v0.4.1.post4 - DeepSeek v3 on Different H200 configurations + +We research the capabilites of two configurations of H200 NVIDIA GPUs: +- Single-node 8xH200 (BF16/FP8) + +For the benchmarking, we choose as baseline parameters: + +- `--random-range-ratio 1` +- `--request-rate 1 ` +- `--random-input 1024` +- `--random-output 1024` + +Complete results and logs for benchmarks are in https://github.com/datacrunch-research/h200-benchmarks + +## DeepSeek V3 on 8xH200 (single-node) + +### BF16 + + +### FP8 + + +## Environment + +To guarantee benchmarking results reproducibility we execute all the experiments with the latest available SGLang Docker image. Build benchmarking environment running the following commands: + +```bash +$docker pull lmsysorg/sglang:dev + +$docker run -it -d --shm-size 32g --gpus all --net host \ +--env "HF_TOKEN=$HF_TOKEN" \ +-v :/root/.cache/huggingface \ +--ipc=host --name sglang_dev lmsysorg/sglang:latest bash + +$docker exec -it /bin/bash sglang_dev +``` + +## Notes + +Keep in mind the diferences in the commands for optimization techniques due to memory constrains. + +## Online benchmarks + +## DeepSeek V3 on 8xH200 (single-node) + +### BF16 + +```bash +# launch server +python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --enable-torch-compile --enable-dp-attention --mem-fraction-static 0.8 --disable-cuda-graph + + +# bench serving +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_BF16_online_output.jsonl + +``` + +### FP8 + +```bash +# launch server +python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 +--quantization fp8 --kv-cache-dtype fp8_e5m2 --trust-remote-code --enable-dp-attention + + +# bench serving +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --output-file deepseek_v3_8xh200_FP8_online_output.jsonl +``` +## Deepseek V3 on 2x8xH200 (multi-node) + +### BF16 + +```bash +# launch server +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --mem-fraction-static 0.8 --disable-cuda-graph + +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --mem-fraction-static 0.8 --disable-cuda-graph + + +# bench serving +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_BF16_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_BF16_online_output.jsonl +``` + +### FP8 + +```bash +# launch server +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --quantization fp8 --kv-cache-dtype fp8_e5m2 --disable-cuda-graph + +python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-torch-compile --quantization fp8 --kv-cache-dtype fp8_e5m2 --disable-cuda-graph + + +# bench serving +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl + +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl +``` + +#### Note: Detach mode +``` +nohup python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 +--quantization fp8 --kv-cache-dtype fp8_e5m2 --trust-remote-code --enable-dp-attention --host 0.0.0.0 --port 40000 &> singlenode_fp8.log & +``` + +``` +nohup deepseek_v3.sh &> deepseek_v3_fp8_8xh200_log_output.txt +``` \ No newline at end of file diff --git a/benchmark/benchmark_v0.4.1.post4/deepseek_v3.sh b/benchmark/benchmark_v0.4.1.post4/deepseek_v3.sh new file mode 100644 index 00000000000..1b0efd91b01 --- /dev/null +++ b/benchmark/benchmark_v0.4.1.post4/deepseek_v3.sh @@ -0,0 +1,69 @@ +# Docker single-node command: (FP8 version) * PROVISIONAL * +: ' +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v /mnt/co-research/shared-models:/root/.cache/huggingface \ + --name sglang_singlenodeFP8 \ + -it \ + -rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --quantization fp8 --kv-cache-dtype fp8_e5m2 --trust-remote-code --host 0.0.0.0 --port 40000 --enable-dp-attention +' + +# Docker multi-node command: (BF16 version) * PROVISIONAL * +# Node0: * PROVISIONAL * +: ' +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v /mnt/co-research/shared-models:/root/.cache/huggingface \ + --name sglang_multinode0 \ + -it \ + --rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 + +' + +# Node1: * PROVISIONAL * +: ' +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v /mnt/co-research/shared-models:/root/.cache/huggingface \ + --name sglang_multinode1 \ + -it \ + --rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 ----dist-init-addr 192.168.114.10:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000 + +' + +# Docker basic client command: * PROVISIONAL * +: ' +docker run --gpus all \ + --shm-size 32g \ + --network=host \ + -v /mnt/co-research/shared-models:/root/.cache/huggingface \ + --name sglang_bnchmrk_client \ + -it \ + --rm \ + --env "HF_TOKEN=$HF_TOKEN" \ + --ipc=host \ + lmsysorg/sglang:latest \ + python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 1 --host 0.0.0.0 --port 40000 +' + +# 8xH200/2x8xH200 FP8/BF16 +# Online +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 300 --request-rate 1 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 600 --request-rate 2 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1200 --request-rate 4 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl +python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 2400 --request-rate 8 --random-input 1024 --random-output 1024 --host 0.0.0.0 --port 40000 --output-file deepseek_v3_2x8xh200_FP8_online_output.jsonl