Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted outputs with Marlin int4 kernels as parallelization increases #332

Open
dacorvo opened this issue Oct 6, 2024 · 18 comments
Open
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@dacorvo
Copy link
Collaborator

dacorvo commented Oct 6, 2024

When using MarlinInt4WeightQBitsTensor and its associated optimized gemm kernel, there are issues with the weight/scales/zero-point readback as soon as parallelization increases.

The consequence is that output features higher than 128 are corrupted when a sufficient amount of inputs are parallelized.

Test to reproduce the issue here:

@pytest.mark.xfail(reason="Bug in Marlin kernel", strict=False)

@dacorvo dacorvo added bug Something isn't working help wanted Extra attention is needed labels Oct 6, 2024
@inarikami
Copy link

inarikami commented Oct 12, 2024

Could be related, but I noticed the latest release of optimum-quanto (v0.2.5) corrupts transformer weights during qfloat8 quantization. Downgrading to 0.2.4 solved this issue. Not sure what the exact cause is but will look into it

Code that caused corruption in 0.2.5 but not earlier versions:

pipe = FluxPipeline.from_pretrained(...

quantize(pipe.transformer, weights=qfloat8)
freeze(pipe.transformer)

quantize(pipe.text_encoder, weights=qfloat8)
freeze(pipe.text_encoder)

quantize(pipe.text_encoder_2, weights=qfloat8)
freeze(pipe.text_encoder_2)

@Leommm-byte
Copy link

Could be related, but I noticed the latest release of optimum-quanto (v0.2.5) corrupts transformer weights during qfloat8 quantization. Downgrading to 0.2.4 solved this issue. Not sure what the exact cause is but will look into it

Code that caused corruption in 0.2.5 but not earlier versions:

pipe = FluxPipeline.from_pretrained(...

quantize(pipe.transformer, weights=qfloat8)
freeze(pipe.transformer)

quantize(pipe.text_encoder, weights=qfloat8)
freeze(pipe.text_encoder)

quantize(pipe.text_encoder_2, weights=qfloat8)
freeze(pipe.text_encoder_2)

Yeah, same here. I was confused at first because the generated image was just pure noise so I downgraded to this version

https://github.com/huggingface/optimum-quanto.git@65ace79d6af6ccc27afbb3576541cc36b3e3a98b

and it worked fine. (This was the 0.25.0.dev0)

@dacorvo
Copy link
Collaborator Author

dacorvo commented Oct 14, 2024

@inarikami @Leommm-byte this cannot be related, as the new Marlin kernel is only available for qint4 and is not used by default.
If you have a script allowing to reproduce the corruption on 0.0.25, feel free to open an issue.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Nov 14, 2024
@dacorvo dacorvo removed the Stale label Nov 14, 2024
@ahadnagy
Copy link

ahadnagy commented Dec 11, 2024

I just quickly ran this through Compute Sanitizer, and it seems like there's a race condition in the Marlin kernel:

compute-sanitizer --tool=racecheck --target-processes=all pytest [REDUCTED]//repos/optimum-quanto/test/tensor/weights/optimized/test_marlin_int4_weight_qbits_tensor.py::test_marlin_int4_weight_qbits_tensor_linear_failing
========= COMPUTE-SANITIZER
=================================================================================================================================================== test session starts ===================================================================================================================================================
platform linux -- Python 3.12.7, pytest-7.4.4, pluggy-1.5.0
rootdir: [REDUCTED]//repos/optimum-quanto
collected 16 items                                                                                                                                                                                                                                                                                                        

test/tensor/weights/optimized/test_marlin_int4_weight_qbits_tensor.py ========= Error: Race reported between Read access at void Marlin<(int)256, (int)3, (int)16, (int)4, (int)4, (int)8>(const int4 *, const int4 *, int4 *, const int4 *, const int4 *, int, int, int, int *)+0x4eb0
=========     and Write access at void Marlin<(int)256, (int)3, (int)16, (int)4, (int)4, (int)8>(const int4 *, const int4 *, int4 *, const int4 *, const int4 *, int, int, int, int *)+0x51c0 [59520 hazards]
========= 
========= Error: Race reported between Read access at void Marlin<(int)256, (int)3, (int)16, (int)4, (int)4, (int)8>(const int4 *, const int4 *, int4 *, const int4 *, const int4 *, int, int, int, int *)+0x4e80
=========     and Write access at void Marlin<(int)256, (int)3, (int)16, (int)4, (int)4, (int)8>(const int4 *, const int4 *, int4 *, const int4 *, const int4 *, int, int, int, int *)+0x51b0 [58752 hazards]
========= 
========= Error: Race reported between Read access at void Marlin<(int)256, (int)3, (int)16, (int)4, (int)4, (int)8>(const int4 *, const int4 *, int4 *, const int4 *, const int4 *, int, int, int, int *)+0x4e40
=========     and Write access at void Marlin<(int)256, (int)3, (int)16, (int)4, (int)4, (int)8>(const int4 *, const int4 *, int4 *, const int4 *, const int4 *, int, int, int, int *)+0x51a0 [57984 hazards]
========= 
========= Error: Race reported between Read access at void Marlin<(int)256, (int)3, (int)16, (int)4, (int)4, (int)8>(const int4 *, const int4 *, int4 *, const int4 *, const int4 *, int, int, int, int *)+0x4120
=========     and Write access at void Marlin<(int)256, (int)3, (int)16, (int)4, (int)4, (int)8>(const int4 *, const int4 *, int4 *, const int4 *, const int4 *, int, int, int, int *)+0x4630 [30336 hazards]
...
...

I'll dig deeper when I can find a spare few hours. :)

@dacorvo
Copy link
Collaborator Author

dacorvo commented Dec 11, 2024

@ahadnagy thank you for your feedback: ping me if you need any help.

@ahadnagy
Copy link

ahadnagy commented Dec 11, 2024

So far I've found two possible race conditions:

The second one appears at relatively small sizes as well:

@pytest.mark.parametrize("tokens", [16])
@pytest.mark.parametrize("in_features", [512])
@pytest.mark.parametrize("out_features", [256])

@dacorvo Could these be related to the memory issues you suspected? Not sure if these are false positives or not (never seen a false positive with this tool so far), I'll have to dig into the indexing and pipelining to make sense of it as this kernel is quite involved.

But first, I'm gonna send identities through in a more minimal reproducer, hopefully a pattern will emerge.

@ahadnagy
Copy link

Race condition confirmed. I tested the kernel with identity weights, and the output is randomly messed up.
Here's a reproducer:
https://github.com/ahadnagy/optimum-quanto/blob/4a6d678239d4a62b94ebce06bbfe0e6e7235abb0/test/tensor/weights/optimized/test_marlin_int4_weight_qbits_tensor.py#L175-L207

I'm not sure how long it's gonna take to debug this. Maybe it's worth submitting an issue to the Marlin repo?

@dacorvo
Copy link
Collaborator Author

dacorvo commented Dec 12, 2024

@ahadnagy thank you for investigating this. This is not the original marlin kernel: it has been modified to integrate a shift in addition to the scale.
I isolated the modifications to the original kernel in this commit:
616980f
Maybe that would help identifying if the issue comes from the original kernel or from the modifications.

@ahadnagy
Copy link

I'll take a look!

@ahadnagy
Copy link

ahadnagy commented Dec 12, 2024

I did the same test on the original Marlin kernel and it exhibists the same behaviour:
https://github.com/ahadnagy/marlin/blob/5a4396ae41ed0acf3b4ac6169ca449eac7717fa1/test.py#L208

For k=32 the result is the same between kernel invocations, at k=64 there are ~60k differing values in the matrices.

Edit: anything above k=32 fails. Maybe dimension k can't spill over a single a single warp at reasonably large n.

@ahadnagy
Copy link

FWIW, it fails quite predictably.
plot

@dacorvo
Copy link
Collaborator Author

dacorvo commented Dec 13, 2024

@ahadnagy thank you for your investigations. I think that at this stage it may be worth creating an issue in the vLLM repository where the marlin kernels are now maintained, although I don't know how much time it would require to create a reproducible example using that version of the kernels: https://github.com/vllm-project/vllm/blob/main/csrc/quantization/gptq_marlin/gptq_marlin.cu.
As far as quanto is concerned, do you have ideas how I could narrow down the investigations to identify what could prevent the race condition ?

@ahadnagy
Copy link

ahadnagy commented Dec 13, 2024

@dacorvo I'm happy to do it!

I'll make an attempt to reproduce this in the vLLM version today. Hopefully it won't get too involved.

As far as debugging the kernel goes, there's no easy way to debug these kind of issues unfortunately.
I see two possible causes:

  1. if the race condition is caused by missing barriers, that's most likely a pipelining issue, so I'd start by dissecting the pipeline and the main loop. (CUDA_LAUNCH_BLOCKING =1 could help verifying this, but it won't affect the streaming API, so in this case it might not help at all). I tried sprinkling more synchronisation primitives at critical places, but that didn't help.

  2. It's also possible that the race condition is a symptom of an indexing issue, and sometimes regions overlap in the shared memory. It's gonna be hard to spot the problem without completely understanding how the partitioning, tiling and slicing works in the kernel. It's not something that's easy to eyeball in this case. Normally, I just go through the kernel by hand and place counters for global and shared memory accesses, then output stats.
    If the situation gets dire enough, it's also a viable option to log all accesses from the places suggested by Compute Sanitizer to another scratchpad tensor, save it as a dataframe, and use statistical methods to find the diverging threads and figure out which indices went over.

In some cases, Nsight Compute's uncoalesced mem. access and bank conflict metrics can also give a hint on top of what Compute Sanitizer finds.

@ahadnagy
Copy link

"Good" news, I was able to reproduce this in the vLLM version as well:
https://github.com/ahadnagy/vllm/blob/d2d7def73a8e9c3843b32fdc9adc8e71605b397c/tests/kernels/test_marlin_gemm.py#L619-L686

It seems like only the gptq Marlin kernel is affected.

It fails for the same shapes, has the same race conditions reported, so everything's in order to submit an issue. I'll do that tomorrow.

@lhjlhj11
Copy link

Could be related, but I noticed the latest release of optimum-quanto (v0.2.5) corrupts transformer weights during qfloat8 quantization. Downgrading to 0.2.4 solved this issue. Not sure what the exact cause is but will look into it
Code that caused corruption in 0.2.5 but not earlier versions:

pipe = FluxPipeline.from_pretrained(...

quantize(pipe.transformer, weights=qfloat8)
freeze(pipe.transformer)

quantize(pipe.text_encoder, weights=qfloat8)
freeze(pipe.text_encoder)

quantize(pipe.text_encoder_2, weights=qfloat8)
freeze(pipe.text_encoder_2)

Yeah, same here. I was confused at first because the generated image was just pure noise so I downgraded to this version

https://github.com/huggingface/optimum-quanto.git@65ace79d6af6ccc27afbb3576541cc36b3e3a98b

and it worked fine. (This was the 0.25.0.dev0)

but it will lead to KeyError: 'time_text_embed.timestep_embedder.linear_1.base_layer.weight_qtype'

@dacorvo
Copy link
Collaborator Author

dacorvo commented Jan 3, 2025

The race condiiton in the GPTQ Marlin Kernel has been fixed: vllm-project/vllm#11493.
Applying similar changes to the quanto int4 Marlin kernels should fix this issue.

@ahadnagy
Copy link

ahadnagy commented Jan 3, 2025

I tried the changes in this kernel version yesterday, it appears to be working. I'll prepare a PR with the changes.

Edit: one thing that could be a showstopper is that the fix increases the required shared memory size, which changes the minimum compute capability to 8.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants