Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: resolve fp8 moe issue #2387

Merged
merged 1 commit into from
Dec 7, 2024
Merged

fix: resolve fp8 moe issue #2387

merged 1 commit into from
Dec 7, 2024

Conversation

zhyncs
Copy link
Member

@zhyncs zhyncs commented Dec 7, 2024

Motivation

fix #2386 #2370 #2366
cc @BBuf @HaiShaw

Modifications

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhyncs zhyncs added the bug Something isn't working label Dec 7, 2024
@HaiShaw
Copy link
Collaborator

HaiShaw commented Dec 7, 2024

@zhyncs wondered if you can wait a bit, I have a PR coming

@zhyncs
Copy link
Member Author

zhyncs commented Dec 7, 2024

@HaiShaw May you build on top of this PR? This is an urgent fix for the main branch.

@HaiShaw
Copy link
Collaborator

HaiShaw commented Dec 7, 2024

@HaiShaw May you build on top of this PR? This is an urgent fix for the main branch.

Sounds good, I will just need to shrink down my diff.

@zhyncs
Copy link
Member Author

zhyncs commented Dec 7, 2024

FYI All 2-GPU unit tests, performance tests, and accuracy tests failed due to the machine itself running out of memory. I manually executed them in the local development environment without any issues. Please ignore.

@@ -319,7 +316,25 @@ class Fp8MoEMethod(FusedMoEMethodBase):
quant_config: The quantization config.
"""

def __init__(self, quant_config: Fp8Config):
def __new__(cls, *args, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need __new__?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FusedMoEMethodBase needs to be inherited, but directly writing it as an import will cause circular dependencies. Currently, a dynamic approach is used to avoid this issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most other changes are what I spotted too, just __new__ doesn't seem to be necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__new__ is used here because we need to modify the class inheritance before instance creation. It's the only method that runs before __init__ and allows us to control how the instance is created, letting us break the circular import by setting up inheritance at runtime rather than import time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use apply in fp8.py, and remove apply setting in __init__.py, should be simply ok?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref #2386

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, let me take a look, my side of ROCm tests has got no complain, so worthy a check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python3 -c "from sglang.srt.layers.fused_moe_triton.fused_moe import fused_moe"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it too, only used in benchmark scripts, so we will fix it, let me continue it tomorrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! ref #2387 (comment)

@HaiShaw
Copy link
Collaborator

HaiShaw commented Dec 7, 2024

FYI All 2-GPU unit tests, performance tests, and accuracy tests failed due to the machine itself running out of memory. I manually executed them in the local development environment without any issues. Please ignore.

Ignore - mean the failed cases?

@zhyncs
Copy link
Member Author

zhyncs commented Dec 7, 2024

FYI All 2-GPU unit tests, performance tests, and accuracy tests failed due to the machine itself running out of memory. I manually executed them in the local development environment without any issues. Please ignore.

Ignore - mean the failed cases?

yes

@zhyncs
Copy link
Member Author

zhyncs commented Dec 7, 2024

Currently, nightly gsm8k and the following gpu-2 have been locally verified. It seems to be an issue with the GPU runner. @merrymercy Please help fix the GPU runner issue. Thanks.

https://github.com/sgl-project/sglang/actions/runs/12212009601/job/34069970746

  unit-test-backend-2-gpu:
    if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
    runs-on: 2-gpu-runner
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Install dependencies
        env:
          FLASHINFER_REPO: ${{ inputs.version == 'nightly' && 'https://flashinfer.ai/whl/nightly/cu121/torch2.4/' || 'https://flashinfer.ai/whl/cu121/torch2.4/' }}
        run: |
          bash scripts/ci_install_dependency.sh

      - name: Evaluate data parallelism accuracy (DP=2)
        timeout-minutes: 10
        run: |
          cd test/srt
          python3 test_data_parallelism.py

      - name: Evaluate MLA accuracy (TP=2)
        timeout-minutes: 10
        run: |
          cd test/srt
          python3 test_mla.py
          python3 test_mla_fp8.py
          python3 test_dp_attention.py

      - name: Test update weights from distributed
        timeout-minutes: 10
        run: |
          cd test/srt
          python3 test_update_weights_from_distributed.py

      - name: Evaluate MoE EP accuracy (TP=2)
        timeout-minutes: 10
        run: |
          cd test/srt
          python3 test_moe_ep.py

  performance-test-2-gpu:
    if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
    runs-on: 2-gpu-runner
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Install dependencies
        env:
          FLASHINFER_REPO: ${{ inputs.version == 'nightly' && 'https://flashinfer.ai/whl/nightly/cu121/torch2.4/' || 'https://flashinfer.ai/whl/cu121/torch2.4/' }}
        run: |
          bash scripts/ci_install_dependency.sh

      - name: Benchmark single latency (TP=2)
        timeout-minutes: 10
        run: |
          cd test/srt
          python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_moe_default

      - name: Benchmark offline throughput (TP=2)
        timeout-minutes: 10
        run: |
          cd test/srt
          python3 -m unittest test_bench_serving.TestBenchServing.test_moe_offline_throughput_default

      - name: Benchmark offline throughput (w/o RadixAttention) (TP=2)
        timeout-minutes: 10
        run: |
          cd test/srt
          python3 -m unittest test_bench_serving.TestBenchServing.test_moe_offline_throughput_without_radix_cache

@zhyncs
Copy link
Member Author

zhyncs commented Dec 7, 2024

bash test.sh ✅

#!/bin/bash
set -ex
python3 test_data_parallelism.py
python3 test_mla.py
python3 test_mla_fp8.py
python3 test_dp_attention.py
python3 test_update_weights_from_distributed.py
python3 test_moe_ep.py
python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_moe_default
python3 -m unittest test_bench_serving.TestBenchServing.test_moe_offline_throughput_default
python3 -m unittest test_bench_serving.TestBenchServing.test_moe_offline_throughput_without_radix_cache

@zhyncs zhyncs merged commit d332aa3 into main Dec 7, 2024
16 of 18 checks passed
@zhyncs zhyncs deleted the zhyncs/fix branch December 7, 2024 11:28
@zhyncs
Copy link
Member Author

zhyncs commented Dec 7, 2024

Let me further explain the fix and design intention of this PR.

  • It dynamically creates a class inheriting from FusedMoEMethodBase only on first instantiation
  • Subsequent instantiations follow the normal path
  • Maintains proper inheritance relationships, allowing Fp8EPMoEMethod to correctly inherit all functionality

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] circular import error in fused_moe_triton
2 participants