-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted outputs with Marlin int4 kernels as parallelization increases #332
Comments
Could be related, but I noticed the latest release of optimum-quanto (v0.2.5) corrupts transformer weights during qfloat8 quantization. Downgrading to 0.2.4 solved this issue. Not sure what the exact cause is but will look into it Code that caused corruption in 0.2.5 but not earlier versions: pipe = FluxPipeline.from_pretrained(...
quantize(pipe.transformer, weights=qfloat8)
freeze(pipe.transformer)
quantize(pipe.text_encoder, weights=qfloat8)
freeze(pipe.text_encoder)
quantize(pipe.text_encoder_2, weights=qfloat8)
freeze(pipe.text_encoder_2) |
Yeah, same here. I was confused at first because the generated image was just pure noise so I downgraded to this version https://github.com/huggingface/optimum-quanto.git@65ace79d6af6ccc27afbb3576541cc36b3e3a98b and it worked fine. (This was the 0.25.0.dev0) |
@inarikami @Leommm-byte this cannot be related, as the new Marlin kernel is only available for |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
I just quickly ran this through Compute Sanitizer, and it seems like there's a race condition in the Marlin kernel:
I'll dig deeper when I can find a spare few hours. :) |
@ahadnagy thank you for your feedback: ping me if you need any help. |
So far I've found two possible race conditions:
The second one appears at relatively small sizes as well:
@dacorvo Could these be related to the memory issues you suspected? Not sure if these are false positives or not (never seen a false positive with this tool so far), I'll have to dig into the indexing and pipelining to make sense of it as this kernel is quite involved. But first, I'm gonna send identities through in a more minimal reproducer, hopefully a pattern will emerge. |
Race condition confirmed. I tested the kernel with identity weights, and the output is randomly messed up. I'm not sure how long it's gonna take to debug this. Maybe it's worth submitting an issue to the Marlin repo? |
@ahadnagy thank you for investigating this. This is not the original marlin kernel: it has been modified to integrate a shift in addition to the scale. |
I'll take a look! |
I did the same test on the original Marlin kernel and it exhibists the same behaviour: For Edit: anything above |
@ahadnagy thank you for your investigations. I think that at this stage it may be worth creating an issue in the vLLM repository where the marlin kernels are now maintained, although I don't know how much time it would require to create a reproducible example using that version of the kernels: https://github.com/vllm-project/vllm/blob/main/csrc/quantization/gptq_marlin/gptq_marlin.cu. |
@dacorvo I'm happy to do it! I'll make an attempt to reproduce this in the vLLM version today. Hopefully it won't get too involved. As far as debugging the kernel goes, there's no easy way to debug these kind of issues unfortunately.
In some cases, Nsight Compute's uncoalesced mem. access and bank conflict metrics can also give a hint on top of what Compute Sanitizer finds. |
"Good" news, I was able to reproduce this in the vLLM version as well: It seems like only the It fails for the same shapes, has the same race conditions reported, so everything's in order to submit an issue. I'll do that tomorrow. |
but it will lead to KeyError: 'time_text_embed.timestep_embedder.linear_1.base_layer.weight_qtype' |
The race condiiton in the GPTQ Marlin Kernel has been fixed: vllm-project/vllm#11493. |
I tried the changes in this kernel version yesterday, it appears to be working. I'll prepare a PR with the changes. Edit: one thing that could be a showstopper is that the fix increases the required shared memory size, which changes the minimum compute capability to 8.0. |
When using MarlinInt4WeightQBitsTensor and its associated optimized gemm kernel, there are issues with the weight/scales/zero-point readback as soon as parallelization increases.
The consequence is that output features higher than 128 are corrupted when a sufficient amount of inputs are parallelized.
Test to reproduce the issue here:
optimum-quanto/test/tensor/weights/optimized/test_marlin_int4_weight_qbits_tensor.py
Line 134 in 852bb9c
The text was updated successfully, but these errors were encountered: