Merge branch 'main' into ying-image-chunk

sgl-project · Dec 9, 2024 · 627e9bd · 627e9bd
2 parents 7deb312 + 3844feb
commit 627e9bd
Show file tree

Hide file tree

Showing 92 changed files with 4,972 additions and 1,346 deletions.
diff --git a/.github/workflows/experiment-runner.yml b/.github/workflows/experiment-runner.yml
@@ -0,0 +1,30 @@
+name: Experiment Runner
+
+on:
+  workflow_dispatch:
+    inputs:
+      script:
+        description: "Experiment Runner Script"
+        default: "configs/sharegpt_config.yaml"
+
+concurrency:
+  group: experiment-runner-${{ github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  experiment-runner-1-gpu:
+    if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
+    runs-on: 1-gpu-runner
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v3
+
+      - name: Install dependencies
+        run: |
+          bash scripts/ci_install_dependency.sh
+
+      - name: Test experiment runner
+        timeout-minutes: 120
+        run: |
+          cd test/srt
+          python3 experiment_runner.py --config ${{ inputs.script }}
diff --git a/.github/workflows/pr-test-rust.yml b/.github/workflows/pr-test-rust.yml
@@ -42,7 +42,7 @@ jobs:
 
   e2e-rust:
     if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
-    runs-on: 1-gpu-runner
+    runs-on: 2-gpu-runner
     steps:
       - name: Checkout code
         uses: actions/checkout@v3
@@ -57,7 +57,7 @@ jobs:
           cd rust
           pip install setuptools-rust wheel build
           python3 -m build
-          pip install dist/*.whl
+          pip install --force-reinstall dist/*.whl
       - name: Run e2e test
         run: |
           cd rust/py_test

diff --git a/.github/workflows/pr-test.yml b/.github/workflows/pr-test.yml
@@ -105,6 +105,12 @@ jobs:
           cd test/srt
           python3 test_update_weights_from_distributed.py
 
+      - name: Evaluate MoE EP accuracy (TP=2)
+        timeout-minutes: 10
+        run: |
+          cd test/srt
+          python3 test_moe_ep.py
+
   performance-test-1-gpu-part-1:
     if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
     runs-on: 1-gpu-runner

diff --git a/.github/workflows/release-pypi-router.yml b/.github/workflows/release-pypi-router.yml
@@ -69,9 +69,10 @@ jobs:
         with:
           path: sglang-repo
 
-      - name: Move rust folder to root and delete sglang-repo
+      - name: Move rust folder to root, copy the license file, and delete sglang-repo
         run: |
           mv sglang-repo/rust/* .
+          mv sglang-repo/LICENSE .
           rm -rf sglang-repo
           ls -alt
 

diff --git a/README.md b/README.md
@@ -16,6 +16,7 @@
 [**Join Bi-Weekly Development Meeting**](https://docs.google.com/document/d/1xEow4eIM152xNcRxqZz9VEcOiTQo8-CEuuQ5qTmkt-E/edit?usp=sharing) | [**Slides**](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#slides) |
 
 ## News
+- [2024/12] 🔥 SGLang v0.4: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs ([blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)).
 - [2024/10] 🔥 The First SGLang Online Meetup ([slides](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#the-first-sglang-online-meetup)).
 - [2024/09] SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
 - [2024/07] Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
@@ -47,13 +48,13 @@ The core features include:
 - [Frontend: Structured Generation Language (SGLang)](https://sgl-project.github.io/frontend/frontend.html)
 
 ## Benchmark And Performance
-Learn more in our release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3 blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)
+Learn more in our release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3 blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/), [v0.4 blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)
 
 ## Roadmap
 [Development Roadmap (2024 Q4)](https://github.com/sgl-project/sglang/issues/1487)
 
 ## Adoption and Sponsorship
-The project is supported by (alphabetically): AMD, Baseten, Etched, Hyperbolic, Jam & Tea Studios, LinkedIn, NVIDIA, RunPod, Stanford, UC Berkeley, xAI and 01.AI.
+The project is supported by (alphabetically): AMD, Baseten, Etched, Hyperbolic, Jam & Tea Studios, LinkedIn, Meituan, NVIDIA, RunPod, Stanford, UC Berkeley, xAI and 01.AI.
 
 ## Acknowledgment and Citation
 We learned from the design and reused code from the following projects: [Guidance](https://github.com/guidance-ai/guidance), [vLLM](https://github.com/vllm-project/vllm), [LightLLM](https://github.com/ModelTC/lightllm), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Outlines](https://github.com/outlines-dev/outlines), and [LMQL](https://github.com/eth-sri/lmql).

diff --git a/benchmark/kernels/fused_moe_triton/README.md b/benchmark/kernels/fused_moe_triton/README.md
@@ -10,7 +10,7 @@ Example usage:
 ```bash
 # Tune Qwen2-57B with FP8 and TP=4
 python benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py \
-    --model Qwen/Qwen2-57B-A14B-Instruct-FP8 \
+    --model Qwen/Qwen2-57B-A14B-Instruct \
     --tp-size 4 \
     --dtype fp8_w8a8 \
     --tune
@@ -34,7 +34,7 @@ python benchmark/kernels/fused_moe_triton/benchmark_vllm_vs_sglang_fused_moe_tri
 
 # Compare with FP8 mode for Qwen2-57B
 python benchmark/kernels/fused_moe_triton/benchmark_vllm_vs_sglang_fused_moe_triton.py \
-    --model Qwen/Qwen2-57B-A14B-Instruct-FP8 \
+    --model Qwen/Qwen2-57B-A14B-Instruct \
     --use-fp8
 
 # Compare with custom TP size
@@ -43,3 +43,7 @@ python benchmark/kernels/fused_moe_triton/benchmark_vllm_vs_sglang_fused_moe_tri
 ```
 
 The benchmark results will be saved as plots and data files in the specified output directory (default: `./configs/benchmark_ops/vllm_sglang_fused_moe/`).
+
+- `benchmark_torch_compile_fused_moe.py`: A tool for benchmarking the performance of the fused MoE kernel with `torch.compile` and original fused MoE kernel.
+
+Usage is the same as `benchmark_vllm_vs_sglang_fused_moe_triton.py`.
diff --git a/benchmark/kernels/fused_moe_triton/benchmark_torch_compile_fused_moe.py b/benchmark/kernels/fused_moe_triton/benchmark_torch_compile_fused_moe.py
@@ -6,6 +6,7 @@
 from transformers import AutoConfig
 
 from sglang.srt.layers.fused_moe_triton.fused_moe import fused_moe as fused_moe_triton
+from sglang.srt.model_executor.cuda_graph_runner import set_torch_compile_config
 
 
 def get_model_config(model_name: str, tp_size: int):
@@ -64,7 +65,7 @@ def fused_topk_native(
     return topk_weights, topk_ids
 
 
-@torch.compile
+@torch.compile(dynamic=False)
 def fused_moe_torch(
     x,
     w1,
@@ -88,7 +89,8 @@ def fused_moe_torch(
     w13_weights = w1[topk_ids]
     w1_weights, w3_weights = torch.chunk(w13_weights, 2, dim=2)
     w2_weights = w2[topk_ids]
-    x1 = F.gelu(torch.einsum("ti,taoi -> tao", x, w1_weights))
+    x1 = torch.einsum("ti,taoi -> tao", x, w1_weights)
+    x1 = F.silu(x1)
     x3 = torch.einsum("ti, taoi -> tao", x, w3_weights)
     expert_outs = torch.einsum("tao, taio -> tai", (x1 * x3), w2_weights)
     return torch.einsum("tai,ta -> ti", expert_outs, topk_weights.to(expert_outs.dtype))
@@ -174,6 +176,7 @@ def benchmark(batch_size, provider, model_config, use_fp8=False):
     print(f"benchmark {provider} with batch_size={batch_size}")
     torch.set_default_device("cuda")
     torch.cuda.manual_seed_all(0)
+    set_torch_compile_config()
 
     num_tokens = batch_size
     num_experts = model_config["num_experts"]

diff --git a/docker/Dockerfile.dev b/docker/Dockerfile.dev
@@ -50,11 +50,11 @@ RUN curl -L https://github.com/clangd/clangd/releases/download/18.1.3/clangd-lin
     && rm -rf clangd_18.1.3 clangd.zip
 
 # Install CMake
-RUN curl -L https://cmake.org/download/#:~:text=cmake%2D3.31.1%2Dlinux%2Dx86_64.tar.gz -o cmake.tar.gz \
-    && tar -xzf cmake.tar.gz \
+RUN wget https://github.com/Kitware/CMake/releases/download/v3.31.1/cmake-3.31.1-linux-x86_64.tar.gz \
+    && tar -xzf cmake-3.31.1-linux-x86_64.tar.gz \
     && cp -r cmake-3.31.1-linux-x86_64/bin/* /usr/local/bin/ \
     && cp -r cmake-3.31.1-linux-x86_64/share/* /usr/local/share/ \
-    && rm -rf cmake-3.31.1-linux-x86_64 cmake.tar.gz
+    && rm -rf cmake-3.31.1-linux-x86_64 cmake-3.31.1-linux-x86_64.tar.gz
 
 # Add yank script
 COPY --chown=root:root <<-"EOF" /usr/local/bin/yank

diff --git a/docker/Dockerfile.rocm b/docker/Dockerfile.rocm
@@ -1,5 +1,5 @@
 # Usage (to build SGLang ROCm docker image):
-#   docker build --build-arg SGL_BRANCH=v0.3.6.post3 -t v0.3.6.post3-rocm620 -f Dockerfile.rocm .
+#   docker build --build-arg SGL_BRANCH=v0.4.0.post1 -t v0.4.0.post1-rocm620 -f Dockerfile.rocm .
 
 # default base image
 ARG BASE_IMAGE="rocm/vllm-dev:20241022"
@@ -33,6 +33,7 @@ RUN python -m pip cache purge
 # Performance environment variable.
 
 ENV HIP_FORCE_DEV_KERNARG=1
+ENV SGLANG_SET_CPU_AFFINITY=1
 ENV SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
 ENV NCCL_MIN_NCHANNELS=112
 

diff --git a/docs/Makefile b/docs/Makefile
@@ -19,7 +19,7 @@ compile:
 			echo "Executing $$nb"; \
 			jupyter nbconvert --to notebook --execute --inplace "$$nb" \
 				--ExecutePreprocessor.timeout=600 \
-				--ExecutePreprocessor.kernel_name=python3; \
+				--ExecutePreprocessor.kernel_name=python3 || exit 1; \
 		fi; \
 	done
 

diff --git a/docs/backend/native_api.ipynb b/docs/backend/native_api.ipynb
@@ -220,19 +220,19 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# failed update with different parameter size\n",
+    "# failed update with different parameter size or wrong name\n",
     "\n",
     "url = \"http://localhost:30010/update_weights_from_disk\"\n",
-    "data = {\"model_path\": \"meta-llama/Llama-3.2-3B\"}\n",
+    "data = {\"model_path\": \"meta-llama/Llama-3.2-1B-wrong\"}\n",
     "\n",
     "response = requests.post(url, json=data)\n",
     "response_json = response.json()\n",
     "print_highlight(response_json)\n",
     "assert response_json[\"success\"] is False\n",
     "assert response_json[\"message\"] == (\n",
-    "    \"Failed to update weights: The size of tensor a (2048) must match \"\n",
-    "    \"the size of tensor b (3072) at non-singleton dimension 1.\\n\"\n",
-    "    \"Rolling back to original weights.\"\n",
+    "    \"Failed to get weights iterator: \"\n",
+    "    \"meta-llama/Llama-3.2-1B-wrong\"\n",
+    "    \" (repository not found).\"\n",
     ")"
    ]
   },

diff --git a/docs/developer/setup_github_runner.md b/docs/developer/setup_github_runner.md
@@ -11,9 +11,9 @@ docker pull nvidia/cuda:12.1.1-devel-ubuntu22.04
 # Nvidia
 docker run --shm-size 128g -it -v /tmp/huggingface:/hf_home --gpus all nvidia/cuda:12.1.1-devel-ubuntu22.04 /bin/bash
 # AMD
-docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.3.6.post3-rocm620 /bin/bash
+docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.4.0.post1-rocm620 /bin/bash
 # AMD just the last 2 GPUs
-docker run --rm --device=/dev/kfd --device=/dev/dri/renderD176 --device=/dev/dri/renderD184 --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.3.6.post3-rocm620 /bin/bash
+docker run --rm --device=/dev/kfd --device=/dev/dri/renderD176 --device=/dev/dri/renderD184 --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.4.0.post1-rocm620 /bin/bash
 ```
 
 ### Step 2: Configure the runner by `config.sh`

diff --git a/docs/index.rst b/docs/index.rst
@@ -39,6 +39,13 @@ The core features include:
    frontend/choices_methods.md
 
 
+.. toctree::
+   :maxdepth: 1
+   :caption: SGLang Router
+
+   router/router.md
+
+
 .. toctree::
    :maxdepth: 1
    :caption: References

diff --git a/docs/references/contributor_guide.md b/docs/references/contributor_guide.md
@@ -1,5 +1,9 @@
 # Contributor Guide
 
+# Build SGLang
+
+See [Install SGLang, Method 2: From Source section](../start/install.md).
+
 ## Format Your Code
 Use these commands to format your code and pass CI linting tests.
 

diff --git a/docs/references/supported_models.md b/docs/references/supported_models.md
@@ -80,3 +80,30 @@ To port a model from vLLM to SGLang, you can compare these two files [SGLang Lla
   - Remove `Sample`.
   - Change `forward()` functions, and add `forward_batch`.
   - Add `EntryClass` at the end.
+
+### Registering an external model implementation
+
+In addition to the methods described above, you can also register your new model with the `ModelRegistry` before launching the server. This approach is useful if you want to integrate your model without needing to modify the source code.
+
+Here is how you can do it:
+
+```python
+from sglang.srt.models.registry import ModelRegistry
+from sglang.srt.server import launch_server
+
+# for a single model, you can add it to the registry
+ModelRegistry.models[model_name] = model_class
+
+# for multiple models, you can imitate the import_model_classes() function in sglang/srt/models/registry.py
+from functools import lru_cache
+
+@lru_cache()
+def import_new_model_classes():
+    model_arch_name_to_cls = {}
+    ...
+    return model_arch_name_to_cls
+
+ModelRegistry.models.update(import_new_model_classes())
+
+launch_server(server_args)
+```