fix: resolve fp8 moe issue #2387

zhyncs · 2024-12-07T10:07:25Z

Motivation

fix #2386 #2370 #2366
cc @BBuf @HaiShaw

Modifications

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

HaiShaw · 2024-12-07T10:12:59Z

@zhyncs wondered if you can wait a bit, I have a PR coming

zhyncs · 2024-12-07T10:14:05Z

@HaiShaw May you build on top of this PR? This is an urgent fix for the main branch.

HaiShaw · 2024-12-07T10:21:11Z

@HaiShaw May you build on top of this PR? This is an urgent fix for the main branch.

Sounds good, I will just need to shrink down my diff.

zhyncs · 2024-12-07T10:23:55Z

FYI All 2-GPU unit tests, performance tests, and accuracy tests failed due to the machine itself running out of memory. I manually executed them in the local development environment without any issues. Please ignore.

HaiShaw · 2024-12-07T10:24:29Z

python/sglang/srt/layers/quantization/fp8.py

@@ -319,7 +316,25 @@ class Fp8MoEMethod(FusedMoEMethodBase):
        quant_config: The quantization config.
    """

-    def __init__(self, quant_config: Fp8Config):
+    def __new__(cls, *args, **kwargs):


why need __new__?

FusedMoEMethodBase needs to be inherited, but directly writing it as an import will cause circular dependencies. Currently, a dynamic approach is used to avoid this issue.

Most other changes are what I spotted too, just __new__ doesn't seem to be necessary?

__new__ is used here because we need to modify the class inheritance before instance creation. It's the only method that runs before __init__ and allows us to control how the instance is created, letting us break the circular import by setting up inheritance at runtime rather than import time.

If we use apply in fp8.py, and remove apply setting in __init__.py, should be simply ok?

Thanks, let me take a look, my side of ROCm tests has got no complain, so worthy a check.

python3 -c "from sglang.srt.layers.fused_moe_triton.fused_moe import fused_moe"

I see it too, only used in benchmark scripts, so we will fix it, let me continue it tomorrow.

Thanks! ref #2387 (comment)

HaiShaw · 2024-12-07T10:26:11Z

FYI All 2-GPU unit tests, performance tests, and accuracy tests failed due to the machine itself running out of memory. I manually executed them in the local development environment without any issues. Please ignore.

Ignore - mean the failed cases?

zhyncs · 2024-12-07T10:26:46Z

FYI All 2-GPU unit tests, performance tests, and accuracy tests failed due to the machine itself running out of memory. I manually executed them in the local development environment without any issues. Please ignore.

Ignore - mean the failed cases?

yes

zhyncs · 2024-12-07T11:09:24Z

Currently, nightly gsm8k and the following gpu-2 have been locally verified. It seems to be an issue with the GPU runner. @merrymercy Please help fix the GPU runner issue. Thanks.

https://github.com/sgl-project/sglang/actions/runs/12212009601/job/34069970746

  unit-test-backend-2-gpu:
    if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
    runs-on: 2-gpu-runner
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Install dependencies
        env:
          FLASHINFER_REPO: ${{ inputs.version == 'nightly' && 'https://flashinfer.ai/whl/nightly/cu121/torch2.4/' || 'https://flashinfer.ai/whl/cu121/torch2.4/' }}
        run: |
          bash scripts/ci_install_dependency.sh

      - name: Evaluate data parallelism accuracy (DP=2)
        timeout-minutes: 10
        run: |
          cd test/srt
          python3 test_data_parallelism.py

      - name: Evaluate MLA accuracy (TP=2)
        timeout-minutes: 10
        run: |
          cd test/srt
          python3 test_mla.py
          python3 test_mla_fp8.py
          python3 test_dp_attention.py

      - name: Test update weights from distributed
        timeout-minutes: 10
        run: |
          cd test/srt
          python3 test_update_weights_from_distributed.py

      - name: Evaluate MoE EP accuracy (TP=2)
        timeout-minutes: 10
        run: |
          cd test/srt
          python3 test_moe_ep.py

  performance-test-2-gpu:
    if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
    runs-on: 2-gpu-runner
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Install dependencies
        env:
          FLASHINFER_REPO: ${{ inputs.version == 'nightly' && 'https://flashinfer.ai/whl/nightly/cu121/torch2.4/' || 'https://flashinfer.ai/whl/cu121/torch2.4/' }}
        run: |
          bash scripts/ci_install_dependency.sh

      - name: Benchmark single latency (TP=2)
        timeout-minutes: 10
        run: |
          cd test/srt
          python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_moe_default

      - name: Benchmark offline throughput (TP=2)
        timeout-minutes: 10
        run: |
          cd test/srt
          python3 -m unittest test_bench_serving.TestBenchServing.test_moe_offline_throughput_default

      - name: Benchmark offline throughput (w/o RadixAttention) (TP=2)
        timeout-minutes: 10
        run: |
          cd test/srt
          python3 -m unittest test_bench_serving.TestBenchServing.test_moe_offline_throughput_without_radix_cache

zhyncs · 2024-12-07T11:28:09Z

bash test.sh ✅

#!/bin/bash
set -ex
python3 test_data_parallelism.py
python3 test_mla.py
python3 test_mla_fp8.py
python3 test_dp_attention.py
python3 test_update_weights_from_distributed.py
python3 test_moe_ep.py
python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_moe_default
python3 -m unittest test_bench_serving.TestBenchServing.test_moe_offline_throughput_default
python3 -m unittest test_bench_serving.TestBenchServing.test_moe_offline_throughput_without_radix_cache

zhyncs · 2024-12-07T11:41:31Z

Let me further explain the fix and design intention of this PR.

It dynamically creates a class inheriting from FusedMoEMethodBase only on first instantiation
Subsequent instantiations follow the normal path
Maintains proper inheritance relationships, allowing Fp8EPMoEMethod to correctly inherit all functionality

fix: resolve fp8 moe issue

253d447

zhyncs added the bug Something isn't working label Dec 7, 2024

zhyncs requested review from merrymercy, Ying1123 and ispobock as code owners December 7, 2024 10:07

zhyncs mentioned this pull request Dec 7, 2024

[Bug] circular import error in fused_moe_triton #2386

Closed

5 tasks

zhyncs added the high priority label Dec 7, 2024

HaiShaw reviewed Dec 7, 2024

View reviewed changes

zhyncs merged commit d332aa3 into main Dec 7, 2024
16 of 18 checks passed

zhyncs deleted the zhyncs/fix branch December 7, 2024 11:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve fp8 moe issue #2387

fix: resolve fp8 moe issue #2387

zhyncs commented Dec 7, 2024

HaiShaw commented Dec 7, 2024

zhyncs commented Dec 7, 2024

HaiShaw commented Dec 7, 2024

zhyncs commented Dec 7, 2024

HaiShaw Dec 7, 2024

zhyncs Dec 7, 2024

HaiShaw Dec 7, 2024

zhyncs Dec 7, 2024

HaiShaw Dec 7, 2024

zhyncs Dec 7, 2024

HaiShaw Dec 7, 2024

zhyncs Dec 7, 2024

HaiShaw Dec 7, 2024

zhyncs Dec 7, 2024

HaiShaw commented Dec 7, 2024

zhyncs commented Dec 7, 2024

zhyncs commented Dec 7, 2024

zhyncs commented Dec 7, 2024

zhyncs commented Dec 7, 2024

fix: resolve fp8 moe issue #2387

fix: resolve fp8 moe issue #2387

Conversation

zhyncs commented Dec 7, 2024

Motivation

Modifications

Checklist

HaiShaw commented Dec 7, 2024

zhyncs commented Dec 7, 2024

HaiShaw commented Dec 7, 2024

zhyncs commented Dec 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HaiShaw commented Dec 7, 2024

zhyncs commented Dec 7, 2024

zhyncs commented Dec 7, 2024

zhyncs commented Dec 7, 2024

zhyncs commented Dec 7, 2024