Skip to content

Commit

Permalink
Merge branch 'main' into ying-image-chunk
Browse files Browse the repository at this point in the history
  • Loading branch information
merrymercy authored Dec 9, 2024
2 parents 7deb312 + 3844feb commit 627e9bd
Show file tree
Hide file tree
Showing 92 changed files with 4,972 additions and 1,346 deletions.
30 changes: 30 additions & 0 deletions .github/workflows/experiment-runner.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: Experiment Runner

on:
workflow_dispatch:
inputs:
script:
description: "Experiment Runner Script"
default: "configs/sharegpt_config.yaml"

concurrency:
group: experiment-runner-${{ github.ref }}
cancel-in-progress: true

jobs:
experiment-runner-1-gpu:
if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
runs-on: 1-gpu-runner
steps:
- name: Checkout code
uses: actions/checkout@v3

- name: Install dependencies
run: |
bash scripts/ci_install_dependency.sh
- name: Test experiment runner
timeout-minutes: 120
run: |
cd test/srt
python3 experiment_runner.py --config ${{ inputs.script }}
4 changes: 2 additions & 2 deletions .github/workflows/pr-test-rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ jobs:
e2e-rust:
if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
runs-on: 1-gpu-runner
runs-on: 2-gpu-runner
steps:
- name: Checkout code
uses: actions/checkout@v3
Expand All @@ -57,7 +57,7 @@ jobs:
cd rust
pip install setuptools-rust wheel build
python3 -m build
pip install dist/*.whl
pip install --force-reinstall dist/*.whl
- name: Run e2e test
run: |
cd rust/py_test
Expand Down
6 changes: 6 additions & 0 deletions .github/workflows/pr-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,12 @@ jobs:
cd test/srt
python3 test_update_weights_from_distributed.py
- name: Evaluate MoE EP accuracy (TP=2)
timeout-minutes: 10
run: |
cd test/srt
python3 test_moe_ep.py
performance-test-1-gpu-part-1:
if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
runs-on: 1-gpu-runner
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/release-pypi-router.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,9 +69,10 @@ jobs:
with:
path: sglang-repo

- name: Move rust folder to root and delete sglang-repo
- name: Move rust folder to root, copy the license file, and delete sglang-repo
run: |
mv sglang-repo/rust/* .
mv sglang-repo/LICENSE .
rm -rf sglang-repo
ls -alt
Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
[**Join Bi-Weekly Development Meeting**](https://docs.google.com/document/d/1xEow4eIM152xNcRxqZz9VEcOiTQo8-CEuuQ5qTmkt-E/edit?usp=sharing) | [**Slides**](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#slides) |

## News
- [2024/12] 🔥 SGLang v0.4: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs ([blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)).
- [2024/10] 🔥 The First SGLang Online Meetup ([slides](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#the-first-sglang-online-meetup)).
- [2024/09] SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
- [2024/07] Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
Expand Down Expand Up @@ -47,13 +48,13 @@ The core features include:
- [Frontend: Structured Generation Language (SGLang)](https://sgl-project.github.io/frontend/frontend.html)

## Benchmark And Performance
Learn more in our release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3 blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)
Learn more in our release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3 blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/), [v0.4 blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)

## Roadmap
[Development Roadmap (2024 Q4)](https://github.com/sgl-project/sglang/issues/1487)

## Adoption and Sponsorship
The project is supported by (alphabetically): AMD, Baseten, Etched, Hyperbolic, Jam & Tea Studios, LinkedIn, NVIDIA, RunPod, Stanford, UC Berkeley, xAI and 01.AI.
The project is supported by (alphabetically): AMD, Baseten, Etched, Hyperbolic, Jam & Tea Studios, LinkedIn, Meituan, NVIDIA, RunPod, Stanford, UC Berkeley, xAI and 01.AI.

## Acknowledgment and Citation
We learned from the design and reused code from the following projects: [Guidance](https://github.com/guidance-ai/guidance), [vLLM](https://github.com/vllm-project/vllm), [LightLLM](https://github.com/ModelTC/lightllm), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Outlines](https://github.com/outlines-dev/outlines), and [LMQL](https://github.com/eth-sri/lmql).
Expand Down
8 changes: 6 additions & 2 deletions benchmark/kernels/fused_moe_triton/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Example usage:
```bash
# Tune Qwen2-57B with FP8 and TP=4
python benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py \
--model Qwen/Qwen2-57B-A14B-Instruct-FP8 \
--model Qwen/Qwen2-57B-A14B-Instruct \
--tp-size 4 \
--dtype fp8_w8a8 \
--tune
Expand All @@ -34,7 +34,7 @@ python benchmark/kernels/fused_moe_triton/benchmark_vllm_vs_sglang_fused_moe_tri

# Compare with FP8 mode for Qwen2-57B
python benchmark/kernels/fused_moe_triton/benchmark_vllm_vs_sglang_fused_moe_triton.py \
--model Qwen/Qwen2-57B-A14B-Instruct-FP8 \
--model Qwen/Qwen2-57B-A14B-Instruct \
--use-fp8

# Compare with custom TP size
Expand All @@ -43,3 +43,7 @@ python benchmark/kernels/fused_moe_triton/benchmark_vllm_vs_sglang_fused_moe_tri
```

The benchmark results will be saved as plots and data files in the specified output directory (default: `./configs/benchmark_ops/vllm_sglang_fused_moe/`).

- `benchmark_torch_compile_fused_moe.py`: A tool for benchmarking the performance of the fused MoE kernel with `torch.compile` and original fused MoE kernel.

Usage is the same as `benchmark_vllm_vs_sglang_fused_moe_triton.py`.
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from transformers import AutoConfig

from sglang.srt.layers.fused_moe_triton.fused_moe import fused_moe as fused_moe_triton
from sglang.srt.model_executor.cuda_graph_runner import set_torch_compile_config


def get_model_config(model_name: str, tp_size: int):
Expand Down Expand Up @@ -64,7 +65,7 @@ def fused_topk_native(
return topk_weights, topk_ids


@torch.compile
@torch.compile(dynamic=False)
def fused_moe_torch(
x,
w1,
Expand All @@ -88,7 +89,8 @@ def fused_moe_torch(
w13_weights = w1[topk_ids]
w1_weights, w3_weights = torch.chunk(w13_weights, 2, dim=2)
w2_weights = w2[topk_ids]
x1 = F.gelu(torch.einsum("ti,taoi -> tao", x, w1_weights))
x1 = torch.einsum("ti,taoi -> tao", x, w1_weights)
x1 = F.silu(x1)
x3 = torch.einsum("ti, taoi -> tao", x, w3_weights)
expert_outs = torch.einsum("tao, taio -> tai", (x1 * x3), w2_weights)
return torch.einsum("tai,ta -> ti", expert_outs, topk_weights.to(expert_outs.dtype))
Expand Down Expand Up @@ -174,6 +176,7 @@ def benchmark(batch_size, provider, model_config, use_fp8=False):
print(f"benchmark {provider} with batch_size={batch_size}")
torch.set_default_device("cuda")
torch.cuda.manual_seed_all(0)
set_torch_compile_config()

num_tokens = batch_size
num_experts = model_config["num_experts"]
Expand Down
6 changes: 3 additions & 3 deletions docker/Dockerfile.dev
Original file line number Diff line number Diff line change
Expand Up @@ -50,11 +50,11 @@ RUN curl -L https://github.com/clangd/clangd/releases/download/18.1.3/clangd-lin
&& rm -rf clangd_18.1.3 clangd.zip

# Install CMake
RUN curl -L https://cmake.org/download/#:~:text=cmake%2D3.31.1%2Dlinux%2Dx86_64.tar.gz -o cmake.tar.gz \
&& tar -xzf cmake.tar.gz \
RUN wget https://github.com/Kitware/CMake/releases/download/v3.31.1/cmake-3.31.1-linux-x86_64.tar.gz \
&& tar -xzf cmake-3.31.1-linux-x86_64.tar.gz \
&& cp -r cmake-3.31.1-linux-x86_64/bin/* /usr/local/bin/ \
&& cp -r cmake-3.31.1-linux-x86_64/share/* /usr/local/share/ \
&& rm -rf cmake-3.31.1-linux-x86_64 cmake.tar.gz
&& rm -rf cmake-3.31.1-linux-x86_64 cmake-3.31.1-linux-x86_64.tar.gz

# Add yank script
COPY --chown=root:root <<-"EOF" /usr/local/bin/yank
Expand Down
3 changes: 2 additions & 1 deletion docker/Dockerfile.rocm
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Usage (to build SGLang ROCm docker image):
# docker build --build-arg SGL_BRANCH=v0.3.6.post3 -t v0.3.6.post3-rocm620 -f Dockerfile.rocm .
# docker build --build-arg SGL_BRANCH=v0.4.0.post1 -t v0.4.0.post1-rocm620 -f Dockerfile.rocm .

# default base image
ARG BASE_IMAGE="rocm/vllm-dev:20241022"
Expand Down Expand Up @@ -33,6 +33,7 @@ RUN python -m pip cache purge
# Performance environment variable.

ENV HIP_FORCE_DEV_KERNARG=1
ENV SGLANG_SET_CPU_AFFINITY=1
ENV SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
ENV NCCL_MIN_NCHANNELS=112

Expand Down
2 changes: 1 addition & 1 deletion docs/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ compile:
echo "Executing $$nb"; \
jupyter nbconvert --to notebook --execute --inplace "$$nb" \
--ExecutePreprocessor.timeout=600 \
--ExecutePreprocessor.kernel_name=python3; \
--ExecutePreprocessor.kernel_name=python3 || exit 1; \
fi; \
done

Expand Down
10 changes: 5 additions & 5 deletions docs/backend/native_api.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -220,19 +220,19 @@
"metadata": {},
"outputs": [],
"source": [
"# failed update with different parameter size\n",
"# failed update with different parameter size or wrong name\n",
"\n",
"url = \"http://localhost:30010/update_weights_from_disk\"\n",
"data = {\"model_path\": \"meta-llama/Llama-3.2-3B\"}\n",
"data = {\"model_path\": \"meta-llama/Llama-3.2-1B-wrong\"}\n",
"\n",
"response = requests.post(url, json=data)\n",
"response_json = response.json()\n",
"print_highlight(response_json)\n",
"assert response_json[\"success\"] is False\n",
"assert response_json[\"message\"] == (\n",
" \"Failed to update weights: The size of tensor a (2048) must match \"\n",
" \"the size of tensor b (3072) at non-singleton dimension 1.\\n\"\n",
" \"Rolling back to original weights.\"\n",
" \"Failed to get weights iterator: \"\n",
" \"meta-llama/Llama-3.2-1B-wrong\"\n",
" \" (repository not found).\"\n",
")"
]
},
Expand Down
4 changes: 2 additions & 2 deletions docs/developer/setup_github_runner.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ docker pull nvidia/cuda:12.1.1-devel-ubuntu22.04
# Nvidia
docker run --shm-size 128g -it -v /tmp/huggingface:/hf_home --gpus all nvidia/cuda:12.1.1-devel-ubuntu22.04 /bin/bash
# AMD
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.3.6.post3-rocm620 /bin/bash
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.4.0.post1-rocm620 /bin/bash
# AMD just the last 2 GPUs
docker run --rm --device=/dev/kfd --device=/dev/dri/renderD176 --device=/dev/dri/renderD184 --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.3.6.post3-rocm620 /bin/bash
docker run --rm --device=/dev/kfd --device=/dev/dri/renderD176 --device=/dev/dri/renderD184 --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.4.0.post1-rocm620 /bin/bash
```

### Step 2: Configure the runner by `config.sh`
Expand Down
7 changes: 7 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,13 @@ The core features include:
frontend/choices_methods.md


.. toctree::
:maxdepth: 1
:caption: SGLang Router

router/router.md


.. toctree::
:maxdepth: 1
:caption: References
Expand Down
4 changes: 4 additions & 0 deletions docs/references/contributor_guide.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Contributor Guide

# Build SGLang

See [Install SGLang, Method 2: From Source section](../start/install.md).

## Format Your Code
Use these commands to format your code and pass CI linting tests.

Expand Down
27 changes: 27 additions & 0 deletions docs/references/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,3 +80,30 @@ To port a model from vLLM to SGLang, you can compare these two files [SGLang Lla
- Remove `Sample`.
- Change `forward()` functions, and add `forward_batch`.
- Add `EntryClass` at the end.

### Registering an external model implementation

In addition to the methods described above, you can also register your new model with the `ModelRegistry` before launching the server. This approach is useful if you want to integrate your model without needing to modify the source code.

Here is how you can do it:

```python
from sglang.srt.models.registry import ModelRegistry
from sglang.srt.server import launch_server

# for a single model, you can add it to the registry
ModelRegistry.models[model_name] = model_class

# for multiple models, you can imitate the import_model_classes() function in sglang/srt/models/registry.py
from functools import lru_cache

@lru_cache()
def import_new_model_classes():
model_arch_name_to_cls = {}
...
return model_arch_name_to_cls

ModelRegistry.models.update(import_new_model_classes())

launch_server(server_args)
```
Loading

0 comments on commit 627e9bd

Please sign in to comment.