Corrupted outputs with Marlin int4 kernels as parallelization increases #332

dacorvo · 2024-10-06T14:35:24Z

When using MarlinInt4WeightQBitsTensor and its associated optimized gemm kernel, there are issues with the weight/scales/zero-point readback as soon as parallelization increases.

The consequence is that output features higher than 128 are corrupted when a sufficient amount of inputs are parallelized.

Test to reproduce the issue here:

optimum-quanto/test/tensor/weights/optimized/test_marlin_int4_weight_qbits_tensor.py

Line 134 in 852bb9c

@pytest.mark.xfail(reason="Bug in Marlin kernel", strict=False)

inarikami · 2024-10-12T01:43:11Z

Could be related, but I noticed the latest release of optimum-quanto (v0.2.5) corrupts transformer weights during qfloat8 quantization. Downgrading to 0.2.4 solved this issue. Not sure what the exact cause is but will look into it

Code that caused corruption in 0.2.5 but not earlier versions:

pipe = FluxPipeline.from_pretrained(...

quantize(pipe.transformer, weights=qfloat8)
freeze(pipe.transformer)

quantize(pipe.text_encoder, weights=qfloat8)
freeze(pipe.text_encoder)

quantize(pipe.text_encoder_2, weights=qfloat8)
freeze(pipe.text_encoder_2)

Leommm-byte · 2024-10-12T15:33:35Z

Could be related, but I noticed the latest release of optimum-quanto (v0.2.5) corrupts transformer weights during qfloat8 quantization. Downgrading to 0.2.4 solved this issue. Not sure what the exact cause is but will look into it

Code that caused corruption in 0.2.5 but not earlier versions:
pipe = FluxPipeline.from_pretrained(...

quantize(pipe.transformer, weights=qfloat8)
freeze(pipe.transformer)

quantize(pipe.text_encoder, weights=qfloat8)
freeze(pipe.text_encoder)

quantize(pipe.text_encoder_2, weights=qfloat8)
freeze(pipe.text_encoder_2)

Yeah, same here. I was confused at first because the generated image was just pure noise so I downgraded to this version

https://github.com/huggingface/optimum-quanto.git@65ace79d6af6ccc27afbb3576541cc36b3e3a98b

and it worked fine. (This was the 0.25.0.dev0)

dacorvo · 2024-10-14T16:14:15Z

@inarikami @Leommm-byte this cannot be related, as the new Marlin kernel is only available for qint4 and is not used by default.
If you have a script allowing to reproduce the corruption on 0.0.25, feel free to open an issue.

github-actions · 2024-11-14T02:02:23Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ahadnagy · 2024-12-11T15:32:58Z

I just quickly ran this through Compute Sanitizer, and it seems like there's a race condition in the Marlin kernel:

compute-sanitizer --tool=racecheck --target-processes=all pytest [REDUCTED]//repos/optimum-quanto/test/tensor/weights/optimized/test_marlin_int4_weight_qbits_tensor.py::test_marlin_int4_weight_qbits_tensor_linear_failing
========= COMPUTE-SANITIZER
=================================================================================================================================================== test session starts ===================================================================================================================================================
platform linux -- Python 3.12.7, pytest-7.4.4, pluggy-1.5.0
rootdir: [REDUCTED]//repos/optimum-quanto
collected 16 items                                                                                                                                                                                                                                                                                                        

test/tensor/weights/optimized/test_marlin_int4_weight_qbits_tensor.py ========= Error: Race reported between Read access at void Marlin<(int)256, (int)3, (int)16, (int)4, (int)4, (int)8>(const int4 *, const int4 *, int4 *, const int4 *, const int4 *, int, int, int, int *)+0x4eb0
=========     and Write access at void Marlin<(int)256, (int)3, (int)16, (int)4, (int)4, (int)8>(const int4 *, const int4 *, int4 *, const int4 *, const int4 *, int, int, int, int *)+0x51c0 [59520 hazards]
========= 
========= Error: Race reported between Read access at void Marlin<(int)256, (int)3, (int)16, (int)4, (int)4, (int)8>(const int4 *, const int4 *, int4 *, const int4 *, const int4 *, int, int, int, int *)+0x4e80
=========     and Write access at void Marlin<(int)256, (int)3, (int)16, (int)4, (int)4, (int)8>(const int4 *, const int4 *, int4 *, const int4 *, const int4 *, int, int, int, int *)+0x51b0 [58752 hazards]
========= 
========= Error: Race reported between Read access at void Marlin<(int)256, (int)3, (int)16, (int)4, (int)4, (int)8>(const int4 *, const int4 *, int4 *, const int4 *, const int4 *, int, int, int, int *)+0x4e40
=========     and Write access at void Marlin<(int)256, (int)3, (int)16, (int)4, (int)4, (int)8>(const int4 *, const int4 *, int4 *, const int4 *, const int4 *, int, int, int, int *)+0x51a0 [57984 hazards]
========= 
========= Error: Race reported between Read access at void Marlin<(int)256, (int)3, (int)16, (int)4, (int)4, (int)8>(const int4 *, const int4 *, int4 *, const int4 *, const int4 *, int, int, int, int *)+0x4120
=========     and Write access at void Marlin<(int)256, (int)3, (int)16, (int)4, (int)4, (int)8>(const int4 *, const int4 *, int4 *, const int4 *, const int4 *, int, int, int, int *)+0x4630 [30336 hazards]
...
...

I'll dig deeper when I can find a spare few hours. :)

dacorvo · 2024-12-11T15:42:47Z

@ahadnagy thank you for your feedback: ping me if you need any help.

ahadnagy · 2024-12-11T22:55:11Z

So far I've found two possible race conditions:

Between fetch_to_shared() and fetch_to_registers():

optimum-quanto/optimum/quanto/library/extensions/cuda/marlin/marlin_cuda_kernel.cu

Line 395 in 66dca87

auto fetch_to_shared = [&] (int pipe, int a_off, bool pred = true) {

optimum-quanto/optimum/quanto/library/extensions/cuda/marlin/marlin_cuda_kernel.cu

Line 451 in 66dca87

ldsm4(frag_a[k % 2][i], &sh_a_stage[a_sh_rd_trans[k % b_sh_wr_iters][i]]);
Between fetch_to_registers() and thread_block_reduce()

optimum-quanto/optimum/quanto/library/extensions/cuda/marlin/marlin_cuda_kernel.cu

Line 445 in 66dca87

reinterpret_cast<int4*>(&frag_s[k % 2])[0] = sh_s_stage[s_sh_rd];

optimum-quanto/optimum/quanto/library/extensions/cuda/marlin/marlin_cuda_kernel.cu

Line 451 in 66dca87

ldsm4(frag_a[k % 2][i], &sh_a_stage[a_sh_rd_trans[k % b_sh_wr_iters][i]]);

The second one appears at relatively small sizes as well:

@pytest.mark.parametrize("tokens", [16])
@pytest.mark.parametrize("in_features", [512])
@pytest.mark.parametrize("out_features", [256])

@dacorvo Could these be related to the memory issues you suspected? Not sure if these are false positives or not (never seen a false positive with this tool so far), I'll have to dig into the indexing and pipelining to make sense of it as this kernel is quite involved.

But first, I'm gonna send identities through in a more minimal reproducer, hopefully a pattern will emerge.

ahadnagy · 2024-12-12T13:09:48Z

Race condition confirmed. I tested the kernel with identity weights, and the output is randomly messed up.
Here's a reproducer:
https://github.com/ahadnagy/optimum-quanto/blob/4a6d678239d4a62b94ebce06bbfe0e6e7235abb0/test/tensor/weights/optimized/test_marlin_int4_weight_qbits_tensor.py#L175-L207

I'm not sure how long it's gonna take to debug this. Maybe it's worth submitting an issue to the Marlin repo?

dacorvo · 2024-12-12T13:15:32Z

@ahadnagy thank you for investigating this. This is not the original marlin kernel: it has been modified to integrate a shift in addition to the scale.
I isolated the modifications to the original kernel in this commit:
616980f
Maybe that would help identifying if the issue comes from the original kernel or from the modifications.

ahadnagy · 2024-12-12T13:17:36Z

I'll take a look!

ahadnagy · 2024-12-12T17:03:06Z

I did the same test on the original Marlin kernel and it exhibists the same behaviour:
https://github.com/ahadnagy/marlin/blob/5a4396ae41ed0acf3b4ac6169ca449eac7717fa1/test.py#L208

For k=32 the result is the same between kernel invocations, at k=64 there are ~60k differing values in the matrices.

Edit: anything above k=32 fails. Maybe dimension k can't spill over a single a single warp at reasonably large n.

ahadnagy · 2024-12-12T19:12:47Z

FWIW, it fails quite predictably.

dacorvo · 2024-12-13T08:02:31Z

@ahadnagy thank you for your investigations. I think that at this stage it may be worth creating an issue in the vLLM repository where the marlin kernels are now maintained, although I don't know how much time it would require to create a reproducible example using that version of the kernels: https://github.com/vllm-project/vllm/blob/main/csrc/quantization/gptq_marlin/gptq_marlin.cu.
As far as quanto is concerned, do you have ideas how I could narrow down the investigations to identify what could prevent the race condition ?

ahadnagy · 2024-12-13T10:15:05Z

@dacorvo I'm happy to do it!

I'll make an attempt to reproduce this in the vLLM version today. Hopefully it won't get too involved.

As far as debugging the kernel goes, there's no easy way to debug these kind of issues unfortunately.
I see two possible causes:

if the race condition is caused by missing barriers, that's most likely a pipelining issue, so I'd start by dissecting the pipeline and the main loop. (CUDA_LAUNCH_BLOCKING =1 could help verifying this, but it won't affect the streaming API, so in this case it might not help at all). I tried sprinkling more synchronisation primitives at critical places, but that didn't help.
It's also possible that the race condition is a symptom of an indexing issue, and sometimes regions overlap in the shared memory. It's gonna be hard to spot the problem without completely understanding how the partitioning, tiling and slicing works in the kernel. It's not something that's easy to eyeball in this case. Normally, I just go through the kernel by hand and place counters for global and shared memory accesses, then output stats.
If the situation gets dire enough, it's also a viable option to log all accesses from the places suggested by Compute Sanitizer to another scratchpad tensor, save it as a dataframe, and use statistical methods to find the diverging threads and figure out which indices went over.

In some cases, Nsight Compute's uncoalesced mem. access and bank conflict metrics can also give a hint on top of what Compute Sanitizer finds.

ahadnagy · 2024-12-14T13:27:45Z

"Good" news, I was able to reproduce this in the vLLM version as well:
https://github.com/ahadnagy/vllm/blob/d2d7def73a8e9c3843b32fdc9adc8e71605b397c/tests/kernels/test_marlin_gemm.py#L619-L686

It seems like only the gptq Marlin kernel is affected.

It fails for the same shapes, has the same race conditions reported, so everything's in order to submit an issue. I'll do that tomorrow.

lhjlhj11 · 2024-12-27T07:48:33Z

Could be related, but I noticed the latest release of optimum-quanto (v0.2.5) corrupts transformer weights during qfloat8 quantization. Downgrading to 0.2.4 solved this issue. Not sure what the exact cause is but will look into it
Code that caused corruption in 0.2.5 but not earlier versions:
pipe = FluxPipeline.from_pretrained(...

quantize(pipe.transformer, weights=qfloat8)
freeze(pipe.transformer)

quantize(pipe.text_encoder, weights=qfloat8)
freeze(pipe.text_encoder)

quantize(pipe.text_encoder_2, weights=qfloat8)
freeze(pipe.text_encoder_2)
Yeah, same here. I was confused at first because the generated image was just pure noise so I downgraded to this version

https://github.com/huggingface/optimum-quanto.git@65ace79d6af6ccc27afbb3576541cc36b3e3a98b

and it worked fine. (This was the 0.25.0.dev0)

but it will lead to KeyError: 'time_text_embed.timestep_embedder.linear_1.base_layer.weight_qtype'

dacorvo · 2025-01-03T08:07:06Z

The race condiiton in the GPTQ Marlin Kernel has been fixed: vllm-project/vllm#11493.
Applying similar changes to the quanto int4 Marlin kernels should fix this issue.

ahadnagy · 2025-01-03T20:08:00Z

I tried the changes in this kernel version yesterday, it appears to be working. I'll prepare a PR with the changes.

Edit: one thing that could be a showstopper is that the fix increases the required shared memory size, which changes the minimum compute capability to 8.0.

dacorvo added bug Something isn't working help wanted Extra attention is needed labels Oct 6, 2024

dacorvo mentioned this issue Oct 6, 2024

Add marlin int4 kernel #333

Merged

dacorvo mentioned this issue Nov 10, 2024

Only random noise is generated with Flux + LoRA with optimum-quanto >= 0.2.5 #343

Open

github-actions bot added the Stale label Nov 14, 2024

dacorvo removed the Stale label Nov 14, 2024

ahadnagy mentioned this issue Dec 14, 2024

[Bug]: Corrupted output with GPTQ Marlin kernel vllm-project/vllm#11205

Closed

1 task

ahadnagy mentioned this issue Jan 6, 2025

fix(library): Propagate upstream Marlin kernel fix #366

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted outputs with Marlin int4 kernels as parallelization increases #332

Corrupted outputs with Marlin int4 kernels as parallelization increases #332

dacorvo commented Oct 6, 2024 •

edited

Loading

inarikami commented Oct 12, 2024 •

edited

Loading

Leommm-byte commented Oct 12, 2024

dacorvo commented Oct 14, 2024

github-actions bot commented Nov 14, 2024

ahadnagy commented Dec 11, 2024 •

edited

Loading

dacorvo commented Dec 11, 2024

ahadnagy commented Dec 11, 2024 •

edited

Loading

ahadnagy commented Dec 12, 2024

dacorvo commented Dec 12, 2024

ahadnagy commented Dec 12, 2024

ahadnagy commented Dec 12, 2024 •

edited

Loading

ahadnagy commented Dec 12, 2024

dacorvo commented Dec 13, 2024

ahadnagy commented Dec 13, 2024 •

edited

Loading

ahadnagy commented Dec 14, 2024

lhjlhj11 commented Dec 27, 2024

dacorvo commented Jan 3, 2025

ahadnagy commented Jan 3, 2025 •

edited

Loading

Corrupted outputs with Marlin int4 kernels as parallelization increases #332

Corrupted outputs with Marlin int4 kernels as parallelization increases #332

Comments

dacorvo commented Oct 6, 2024 • edited Loading

inarikami commented Oct 12, 2024 • edited Loading

Leommm-byte commented Oct 12, 2024

dacorvo commented Oct 14, 2024

github-actions bot commented Nov 14, 2024

ahadnagy commented Dec 11, 2024 • edited Loading

dacorvo commented Dec 11, 2024

ahadnagy commented Dec 11, 2024 • edited Loading

ahadnagy commented Dec 12, 2024

dacorvo commented Dec 12, 2024

ahadnagy commented Dec 12, 2024

ahadnagy commented Dec 12, 2024 • edited Loading

ahadnagy commented Dec 12, 2024

dacorvo commented Dec 13, 2024

ahadnagy commented Dec 13, 2024 • edited Loading

ahadnagy commented Dec 14, 2024

lhjlhj11 commented Dec 27, 2024

dacorvo commented Jan 3, 2025

ahadnagy commented Jan 3, 2025 • edited Loading

dacorvo commented Oct 6, 2024 •

edited

Loading

inarikami commented Oct 12, 2024 •

edited

Loading

ahadnagy commented Dec 11, 2024 •

edited

Loading

ahadnagy commented Dec 11, 2024 •

edited

Loading

ahadnagy commented Dec 12, 2024 •

edited

Loading

ahadnagy commented Dec 13, 2024 •

edited

Loading

ahadnagy commented Jan 3, 2025 •

edited

Loading