-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: resolve fp8 moe issue #2387
Conversation
@zhyncs wondered if you can wait a bit, I have a PR coming |
@HaiShaw May you build on top of this PR? This is an urgent fix for the main branch. |
Sounds good, I will just need to shrink down my diff. |
FYI All 2-GPU unit tests, performance tests, and accuracy tests failed due to the machine itself running out of memory. I manually executed them in the local development environment without any issues. Please ignore. |
@@ -319,7 +316,25 @@ class Fp8MoEMethod(FusedMoEMethodBase): | |||
quant_config: The quantization config. | |||
""" | |||
|
|||
def __init__(self, quant_config: Fp8Config): | |||
def __new__(cls, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need __new__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FusedMoEMethodBase needs to be inherited, but directly writing it as an import will cause circular dependencies. Currently, a dynamic approach is used to avoid this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most other changes are what I spotted too, just __new__
doesn't seem to be necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__new__
is used here because we need to modify the class inheritance before instance creation. It's the only method that runs before __init__
and allows us to control how the instance is created, letting us break the circular import by setting up inheritance at runtime rather than import time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use apply
in fp8.py
, and remove apply
setting in __init__.py
, should be simply ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref #2386
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, let me take a look, my side of ROCm tests has got no complain, so worthy a check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python3 -c "from sglang.srt.layers.fused_moe_triton.fused_moe import fused_moe"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it too, only used in benchmark scripts, so we will fix it, let me continue it tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! ref #2387 (comment)
Ignore - mean the failed cases? |
yes |
Currently, nightly gsm8k and the following gpu-2 have been locally verified. It seems to be an issue with the GPU runner. @merrymercy Please help fix the GPU runner issue. Thanks. https://github.com/sgl-project/sglang/actions/runs/12212009601/job/34069970746
|
bash test.sh ✅ #!/bin/bash
set -ex
python3 test_data_parallelism.py
python3 test_mla.py
python3 test_mla_fp8.py
python3 test_dp_attention.py
python3 test_update_weights_from_distributed.py
python3 test_moe_ep.py
python3 -m unittest test_bench_one_batch.TestBenchOneBatch.test_moe_default
python3 -m unittest test_bench_serving.TestBenchServing.test_moe_offline_throughput_default
python3 -m unittest test_bench_serving.TestBenchServing.test_moe_offline_throughput_without_radix_cache |
Let me further explain the fix and design intention of this PR.
|
Motivation
fix #2386 #2370 #2366
cc @BBuf @HaiShaw
Modifications
Checklist