-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrays larger than 4 GB crashes #325
Comments
I did some further tests and it seems like allocating more than 4GB returns garbage or randomly crashes. Example of allocating less than 4GB in A770 16GB. The mean is around 0.5 which is expected.
Example of allocating more than 4GB on CPU. The mean is around 0.5 which is expected.
Example of allocating more than 4GB on A770 16GB. The mean is around 0.014 which is completely wrong.
In conclusion, allocating more than 4GB crashes or returns complete garbage. |
@jingxu10 |
It should be allocated by Level-0. |
Will passing https://spec.oneapi.io/level-zero/latest/core/PROG.html#module-build-options |
Hi, @BA8F0D39 |
Hi @BA8F0D39 Thank you for using intel product and IPEX. And is it possible to add the following flags and attach the log here when you find the error?
Thank you. |
On windows 11 WSL
Code
|
On Ubuntu 22.04 Linux 6.3. It also crashes, but only after I close python.
Code
Crash
|
I believe this issue is caused by incorrect env setting. You can follow this blog to setup IPEX environment on WSL2 with docker: https://medium.com/intel-analytics-software/stable-diffusion-with-intel-arc-gpus-f2986bba8365 |
@cchheennhhaaoo On Ubuntu 22.04 Linux 6.3. It also crashes, but only after I close python.
Code
Crash
|
I am able to replicate the same issue on Fedora 37 with 6.2 and Ubuntu 22.04 with 5.19. Both instances involve a build from the latest |
It is weird the crash error is only reported when you enable DEBUG flags, otherwise the code silently crashes.
|
Here is some quick findings I had, it's not exactly at 4GB, I don't think the gibberish is related...
For FP16, I have some other weird bugs that sometimes it works, sometimes it doesn't even for small array (less than 10000x10000). Even for multiple consecutive run, it might work for 50 times in a row, than go bonkers for 10. For FP32, the gibberish starts appearing at around 30800x30800 which is 3.79456GB. Before that starting around 30400x30400, it is gibberish and then a good output in alternance when doing multiple succesive runs. Which such numerical instability, I might write a script and test every possible combination at this point, might be worth to take a look at other random sampling methods too. |
Just did another quick run for FP32 at 30800x30800 and this time, it works just fine (even 32000x32000 works this time around), there is some weird instability going on... Quick thought, since I am not using a fixed seed in those tests, might it be that some "bad seeds" are cause the instability? |
@fredlarochelle If the adjacent memory locations just so happens to have zeros, then the mean is around 0. If the adjacent memory locations just so happens to have uniformly distributed values from 0 to 1, then the mean is 0.5 . It could allow you to read other program's data in the GPU. |
@BA8F0D39 That would make sense, but I still do get the instability for FP16 and FP32 start acting weird before it before it would actually overfill a 32bit buffer + instability, there is probably more than one problem going on at the same time. |
@fredlarochelle @BA8F0D39 Thanks for feedbacks. The issue mentioned here (so-called numerical instability) looks like one we met recently in internal test. The issue might be caused cache consistency after global memory fence. We are following. BTW, as for crashes when allocating memory larger than 4GB, we cannot reproduce on recommended driver. |
@arthuryuan1987 On Ubuntu Linux 22.04 with 5.19 out of tree driver (intel-i915-dkms intel-platform-vsec-dkms intel-platform-cse-dkms intel-fw-gpu), it randomly crashes and it is not deterministic. On Ubuntu Linux 22.04 with 6.3 mainline kernel, it also randomly crashes. I can force it to crash 100% of the time if you enable debug flags.
|
@arthuryuan1987 I am on Ubuntu 22.04.2 5.19.0.41-generic, on the lastest driver, all following the installation instructions in the documentation with a build from the lastest commit in the |
I used a Vulkan GPU memory tester. It seems all memory regions above 4GB are corrupt and the read transfer speed is 1.9 GB/s.
|
@BA8F0D39 I checked the repo, https://github.com/GpuZelenograd/memtest_vulkan |
Could you please provide an update on the status of this issue? On the lastest |
I second this issue. As a result I am switching hardware and seeing how I fair with rocm then as a final fallback cuda. |
i got it to work by making a very stupid patch to intel-compute-runtime, and i was able to run the code below but the result was around 0.24 i think.
|
@mahiro21h : |
@RogerWeihrauch yes, i tested the same code before and after applying the patch. it's using my arc a770 |
Right which is the same result that the issue opener ran into, the memory allocation is invalid past the 4GB mark and has garbage hence why you don't get 0.5 or something close to it which is what you should get.
No, this is a universal limitation. Intel needs to do something to solve it for all of us. |
@mahiro21h : |
@ALL |
@RogerWeihrauch We have intel employee @CaoZhongZ who is revisiting this issue now. Per current evaluation, >4GB allocation from Framework is dependent on >4GB support for GPU matmul feature on oneDNN. We will update here once we have more solid results. |
I've being digging the whole stack for points that prevent the allocation. So, after removing the allocation blocker inside IPEX, please try this environment variable:
See if it works. |
I am guessing this needs a modified IPEX without the usual check? I get the usual runtime error should I try it on the released vanilla version of IPEX 2.5.10+xpu.
I will rebuild a custom version again without the check on the |
@RogerWeihrauch hi, i was going to wait before posting the steps until i figured out what's wrong with the allocation, but i instead ended up wasting a week trying to get ipex to build again... you're most likely going to simply get garbage output when using it in comfy but i'll post the patches anyway. i don't know what's the best way to share them with everyone so i'll put them on pastebin |
i made a mistake in my original patch for intel compute runtime and didn't include a certain change when i tested ipex and posted my results. the patch im sending contains the change, so this may improve the results or have no effect at all. if you don't want the change just remove the change made to patches: |
@simonlui In this case, you could just try PyTorch XPU release without IPEX to test whether the system could allocate >4GB memory. The upstream allocator is now removed limitation check. We'll remove the limitation on ARC770 in next release. You could wait if not in a hurry. |
@CaoZhongZ It seems to work without the check, although I haven't tested if it actually spits out the right things with the code with the
I will test the nightly Pytorch package later and run some more tests to make sure this actually works as intended. |
@CaoZhongZ I left it on overnight on a large Stable Diffusion task with a big resolution to go >4GB and then let it AOT compile overnight and when I looked at it in the morning, it failed. I then ran the code from the issue opener and I see the same issue with the allocation where XPU is giving a value around half of what it should be at 0.25 while it should be near 0.5. Did a comparison with CPU also.
You see the same result with the nightly Pytorch xpu package.
Seems like the issue is still not resolved and we're back to how IPEX operated before the allocation limits were placed. |
Thanks for the triage, working on it. |
@ALL: Hi |
no, not yet sadly. |
@simonlui how big was the image you tried to generate with stable diffusion? i was able to generate images as big as 1792x1792 in automatic1111. it took about 6 minutes per image |
For this instance for the purposes of testing after the AOT compile worked? 1536x1536 with no VRAM saving techniques for quality and |
i think we both ran into the same issue. i tried to generate a 1024x1024 image using the latest release of comfy without any vram optimizations or other flags, and i also got a UR error when i click again to generate an image after having already received an out of memory error. if i try to instead generate a 768x768 image, i notice my vram usage shoots up to 13.6gb so i guess there really wasn't enough vram left? also, when you tried to generate that image, it complained that it tried to allocate an absurd amount of memory but couldn't, didn't it? it tried to allocate 40gb in my case so i just switched to automatic1111 |
Use this repo instead of original if you want to enable 4GB support for ipex-2.5 (Enable bindless kernels) Modify here: to use this: Also change here: to |
Use
enable AOT for arc770. |
@CaoZhongZ This does work at least for the limited test the issue opener brought up. However, I had some major trouble trying to compile with your branch of
As the two mean values with Some questions though about this fix. 1.) To use this, will you need to use Thank you for your patience in working with me. |
The "UR" in environment variable means "unified runtime" which is an integrated part of compiler. So, if we want to ditch the environment variable we have to update compiler, and this is something people have to wait a little bit longer. We don't quite know the empirical performance impact of bindless on ARC. It affects all those kernels that didn't work on >4GB buffers but not those already functioning. And for diffusers and transformers the large portion of performance generated by systolic so I would expected not so much end-to-end change. The decision to fix the issue depends on maturity of toolchain and driver. Given we have released BMG which in nature is bindless for all occasions, we believe the time has come. The release schedule is what we will discuss so I'll bring up the nightly release thing. |
@CaoZhongZ So good news and bad news. I managed to get this working on top of IPEX v2.5.10+xpu, had to make a custom patch and put it into the
I can open a separate issue for this if needed but yeah, it seems like there are still issues in the UR runtime possibly that might make using >4GB of VRAM allocations still not practical even if it technically works. |
I had success with this: ENV PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 Although I'm sure it doesn't correct the core issue |
Describe the bug
Intel compute runtime doesn't allow allocating a buffer bigger than 4 GB.
intel/compute-runtime#627
When you allocate an array in intel-extension-for-pytorch bigger than 4 GB in A770 16GB, it crashes.
Is it possible to allocate multiple buffers for an array instead of allocating one buffer for one array?
Versions
The text was updated successfully, but these errors were encountered: