Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrays larger than 4 GB crashes #325

Open
BA8F0D39 opened this issue Apr 8, 2023 · 92 comments
Open

Arrays larger than 4 GB crashes #325

BA8F0D39 opened this issue Apr 8, 2023 · 92 comments
Assignees
Labels
ARC ARC GPU Crash Execution crashes

Comments

@BA8F0D39
Copy link

BA8F0D39 commented Apr 8, 2023

Describe the bug

Intel compute runtime doesn't allow allocating a buffer bigger than 4 GB.

intel/compute-runtime#627

When you allocate an array in intel-extension-for-pytorch bigger than 4 GB in A770 16GB, it crashes.

x = torch.rand(46000, 46000, dtype=torch.float32, device='xpu')

Is it possible to allocate multiple buffers for an array instead of allocating one buffer for one array?

Versions

Collecting environment information...
PyTorch version: 1.13.0a0+gitb1dde16
PyTorch CXX11 ABI: Yes
IPEX version: 1.13.10+xpu
IPEX commit: 7d85b0e92
Build type: Release

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: N/A
IGC version: N/A
CMake version: N/A
Libc version: glibc-2.35

Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-6.3.0-1-x86_64-with-glibc2.35
Is XPU available: True
DPCPP runtime version: N/A
MKL version: N/A
GPU models and configuration: 
[0] _DeviceProperties(name='Intel(R) Graphics [0x56a0]', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=15473MB, max_compute_units=512)
Intel OpenCL ICD version: 22.43.24595.35+i538~22.04
Level Zero version: 1.3.24595.35+i538~22.04

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          20
On-line CPU(s) list:             0-19
Vendor ID:                       GenuineIntel
BIOS Vendor ID:                  Intel(R) Corporation
Model name:                      13th Gen Intel(R) Core(TM) i5-13600K
BIOS Model name:                 13th Gen Intel(R) Core(TM) i5-13600K
CPU family:                      6
Model:                           183
Thread(s) per core:              2
Core(s) per socket:              14
Socket(s):                       1
Stepping:                        1
CPU max MHz:                     5100.0000
CPU min MHz:                     800.0000
BogoMIPS:                        6991.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       544 KiB (14 instances)
L1i cache:                       704 KiB (14 instances)
L2 cache:                        20 MiB (8 instances)
L3 cache:                        24 MiB (1 instance)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-19
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==1.13.10+xpu
[pip3] numpy==1.24.1
[pip3] torch==1.13.0a0+gitb1dde16
[pip3] torchvision==0.14.1a0+0504df5
[conda] N/A

@BA8F0D39 BA8F0D39 closed this as completed Apr 8, 2023
@BA8F0D39 BA8F0D39 reopened this Apr 14, 2023
@jingxu10 jingxu10 added ARC ARC GPU Crash Execution crashes labels Apr 16, 2023
@jingxu10
Copy link
Contributor

@tye1

@BA8F0D39
Copy link
Author

I did some further tests and it seems like allocating more than 4GB returns garbage or randomly crashes.

Example of allocating less than 4GB in A770 16GB. The mean is around 0.5 which is expected.

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(30000, 30000, dtype=torch.float32, device='xpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())


python3 ./test.py 
 Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")
Mean
0.50001085

Example of allocating more than 4GB on CPU. The mean is around 0.5 which is expected.

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(47000, 47000, dtype=torch.float32, device='cpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())



python3 ./test.py 
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")
Mean
0.4999941

Example of allocating more than 4GB on A770 16GB. The mean is around 0.014 which is completely wrong.

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())


python3 ./test.py 
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")
Mean
0.014004011

In conclusion, allocating more than 4GB crashes or returns complete garbage.

@BA8F0D39
Copy link
Author

@jingxu10
Is memory allocation done by OpenCL, Level Zero, or OneDNN?

@jingxu10
Copy link
Contributor

It should be allocated by Level-0.
@gujinghui

@BA8F0D39
Copy link
Author

@jingxu10

Will passing -ze-opt-greater-than-4GB-buffer-required into the build options fix it?

https://spec.oneapi.io/level-zero/latest/core/PROG.html#module-build-options

@cchheennhhaaoo
Copy link

cchheennhhaaoo commented Apr 27, 2023

Hi, @BA8F0D39
What's the driver version? I cannot reproduce randomly crash with agama-ci-devel-602. From what I've tried, the max workable input shape of your ut is about 59500*59500, corresponds memory size of 13.2G. It is a reasonable result.
For accuracy issue, we will check it.

@zejun-chen
Copy link
Contributor

Hi @BA8F0D39

Thank you for using intel product and IPEX.
Now we can successfully create large memory(not larger than total physical memory size) and compute well.
Can you provide the driver version you are using by the below?
sudo dpkg -l | grep intel

And is it possible to add the following flags and attach the log here when you find the error?

export SYCL_PI_TRACE=-1
export ZE_DEBUG=-1

Thank you.

@BA8F0D39
Copy link
Author

BA8F0D39 commented Apr 27, 2023

@cchheennhhaaoo

On windows 11 WSL

ii  intel-level-zero-gpu                  1.3.24595.35+i538~22.04                 amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-oneapi-runtime-ccl              2021.8.0-25371                          amd64        Intel® oneAPI Collective Communications Library runtime
ii  intel-oneapi-runtime-compilers        2023.0.0-25370                          amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-compilers-common 2023.0.0-25370                          all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-dpcpp-cpp        2023.0.0-25370                          amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-dpcpp-cpp-common 2023.0.0-25370                          all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-mkl              2023.0.0-25398                          amd64        Intel® oneAPI Math Kernel Library runtime
ii  intel-oneapi-runtime-mkl-common       2023.0.0-25398                          all          Intel® oneAPI Math Kernel Library runtime common
ii  intel-oneapi-runtime-mpi              2021.8.0-25329                          amd64        Intel® MPI Library runtime
ii  intel-oneapi-runtime-opencl           2023.0.0-25370                          amd64        Intel® CPU Runtime for OpenCL(TM) Applications runtime
ii  intel-oneapi-runtime-openmp           2023.0.0-25370                          amd64        Intel® OpenMP* Runtime Library runtime
ii  intel-oneapi-runtime-openmp-common    2023.0.0-25370                          all          l_openmp.runtime.description>
ii  intel-oneapi-runtime-tbb              2021.8.0-25334                          amd64        Intel® oneAPI Threading Building Blocks runtime
ii  intel-oneapi-runtime-tbb-common       2021.8.0-25334                          all          Intel® oneAPI Threading Building Blocks runtime common
ii  intel-opencl-icd                      22.43.24595.35+i538~22.04               amd64        Intel graphics compute runtime for OpenCL

Code

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())
ZE ---> zeContextDestroy(DestoryZeContext)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE_DEBUG=4: check balance of create/destroy calls
----------------------------------------------------------
               zeContextCreate = 1     \--->              zeContextDestroy = 1
          zeCommandQueueCreate = 1     \--->         zeCommandQueueDestroy = 1
                zeModuleCreate = 1     \--->               zeModuleDestroy = 1
                zeKernelCreate = 1     \--->               zeKernelDestroy = 1
             zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1
  zeCommandListCreateImmediate = 1     |
           zeCommandListCreate = 2     \--->          zeCommandListDestroy = 3
                 zeEventCreate = 8     \--->                zeEventDestroy = 8
                 zeFenceCreate = 2     \--->                zeFenceDestroy = 2
                 zeImageCreate = 0     \--->                zeImageDestroy = 0
               zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0
              zeMemAllocDevice = 1     |
                zeMemAllocHost = 0     |
              zeMemAllocShared = 0     \--->                     zeMemFree = 0     ---> LEAK = 1
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -38 (PI_ERROR_INVALID_MEM_OBJECT) -38 (PI_ERROR_INVALID_MEM_OBJECT)
Aborted

crashlog.txt

@BA8F0D39
Copy link
Author

BA8F0D39 commented Apr 27, 2023

@cchheennhhaaoo
@zejun-chen

On Ubuntu 22.04 Linux 6.3. It also crashes, but only after I close python.

ii  intel-level-zero-gpu                  1.3.25593.18-601~22.04                   amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-oneapi-runtime-ccl              2021.9.0-43543                           amd64        Intel® oneAPI Collective Communications Library runtime
ii  intel-oneapi-runtime-compilers        2023.1.0-46305                           amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-compilers-common 2023.1.0-46305                           all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-dpcpp-cpp        2023.1.0-46305                           amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-dpcpp-cpp-common 2023.1.0-46305                           all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-mkl              2023.1.0-46342                           amd64        Intel® oneAPI Math Kernel Library runtime
ii  intel-oneapi-runtime-mkl-common       2023.1.0-46342                           all          Intel® oneAPI Math Kernel Library runtime common
ii  intel-oneapi-runtime-mpi              2021.9.0-43482                           amd64        Intel® MPI Library runtime
ii  intel-oneapi-runtime-opencl           2023.1.0-46305                           amd64        Intel® CPU Runtime for OpenCL(TM) Applications runtime
ii  intel-oneapi-runtime-openmp           2023.1.0-46305                           amd64        Intel® OpenMP* Runtime Library runtime
ii  intel-oneapi-runtime-openmp-common    2023.1.0-46305                           all          l_openmp.runtime.description>
ii  intel-oneapi-runtime-tbb              2021.9.0-43484                           amd64        Intel® oneAPI Threading Building Blocks runtime
ii  intel-oneapi-runtime-tbb-common       2021.9.0-43484                           all          Intel® oneAPI Threading Building Blocks runtime common
ii  intel-opencl-icd                      23.05.25593.18-601~22.04                 amd64        Intel graphics compute runtime for OpenCL
ii  libdrm-intel1:amd64                   2.4.115+git2303241447.28d9a3c4~j~mesarc0 amd64        Userspace interface to intel-specific kernel DRM services -- runtime

Code

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())

Crash

ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventPoolDestroy(ZePool)
ZE ---> zeCommandListDestroy(ZeCommandListInit)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeMemFree(Context->ZeContext, Ptr)
ZE ---> zeContextDestroy(DestoryZeContext)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE_DEBUG=4: check balance of create/destroy calls
----------------------------------------------------------
               zeContextCreate = 1     \--->              zeContextDestroy = 1    
          zeCommandQueueCreate = 2     \--->         zeCommandQueueDestroy = 2    
                zeModuleCreate = 2     \--->               zeModuleDestroy = 2    
                zeKernelCreate = 3     \--->               zeKernelDestroy = 3    
             zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1    
  zeCommandListCreateImmediate = 1     | 
           zeCommandListCreate = 5     \--->          zeCommandListDestroy = 6    
                 zeEventCreate = 18    \--->                zeEventDestroy = 18   
                 zeFenceCreate = 5     \--->                zeFenceDestroy = 5    
                 zeImageCreate = 0     \--->                zeImageDestroy = 0    
               zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0    
              zeMemAllocDevice = 2     | 
                zeMemAllocHost = 0     | 
              zeMemAllocShared = 0     \--->                     zeMemFree = 1     ---> LEAK = 1
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -38 (PI_ERROR_INVALID_MEM_OBJECT) -38 (PI_ERROR_INVALID_MEM_OBJECT)
 
 
Aborted (core dumped)

crash2.txt

@cchheennhhaaoo
Copy link

I believe this issue is caused by incorrect env setting. You can follow this blog to setup IPEX environment on WSL2 with docker: https://medium.com/intel-analytics-software/stable-diffusion-with-intel-arc-gpus-f2986bba8365

@BA8F0D39
Copy link
Author

BA8F0D39 commented Apr 27, 2023

@cchheennhhaaoo
@zejun-chen
I have the same problem on Ubuntu Linux too (not using windows)

On Ubuntu 22.04 Linux 6.3. It also crashes, but only after I close python.

ii  intel-level-zero-gpu                  1.3.25593.18-601~22.04                   amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-oneapi-runtime-ccl              2021.9.0-43543                           amd64        Intel® oneAPI Collective Communications Library runtime
ii  intel-oneapi-runtime-compilers        2023.1.0-46305                           amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-compilers-common 2023.1.0-46305                           all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-dpcpp-cpp        2023.1.0-46305                           amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-dpcpp-cpp-common 2023.1.0-46305                           all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-mkl              2023.1.0-46342                           amd64        Intel® oneAPI Math Kernel Library runtime
ii  intel-oneapi-runtime-mkl-common       2023.1.0-46342                           all          Intel® oneAPI Math Kernel Library runtime common
ii  intel-oneapi-runtime-mpi              2021.9.0-43482                           amd64        Intel® MPI Library runtime
ii  intel-oneapi-runtime-opencl           2023.1.0-46305                           amd64        Intel® CPU Runtime for OpenCL(TM) Applications runtime
ii  intel-oneapi-runtime-openmp           2023.1.0-46305                           amd64        Intel® OpenMP* Runtime Library runtime
ii  intel-oneapi-runtime-openmp-common    2023.1.0-46305                           all          l_openmp.runtime.description>
ii  intel-oneapi-runtime-tbb              2021.9.0-43484                           amd64        Intel® oneAPI Threading Building Blocks runtime
ii  intel-oneapi-runtime-tbb-common       2021.9.0-43484                           all          Intel® oneAPI Threading Building Blocks runtime common
ii  intel-opencl-icd                      23.05.25593.18-601~22.04                 amd64        Intel graphics compute runtime for OpenCL
ii  libdrm-intel1:amd64                   2.4.115+git2303241447.28d9a3c4~j~mesarc0 amd64        Userspace interface to intel-specific kernel DRM services -- runtime

Code

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())

Crash

ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventPoolDestroy(ZePool)
ZE ---> zeCommandListDestroy(ZeCommandListInit)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeMemFree(Context->ZeContext, Ptr)
ZE ---> zeContextDestroy(DestoryZeContext)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE_DEBUG=4: check balance of create/destroy calls
----------------------------------------------------------
               zeContextCreate = 1     \--->              zeContextDestroy = 1    
          zeCommandQueueCreate = 2     \--->         zeCommandQueueDestroy = 2    
                zeModuleCreate = 2     \--->               zeModuleDestroy = 2    
                zeKernelCreate = 3     \--->               zeKernelDestroy = 3    
             zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1    
  zeCommandListCreateImmediate = 1     | 
           zeCommandListCreate = 5     \--->          zeCommandListDestroy = 6    
                 zeEventCreate = 18    \--->                zeEventDestroy = 18   
                 zeFenceCreate = 5     \--->                zeFenceDestroy = 5    
                 zeImageCreate = 0     \--->                zeImageDestroy = 0    
               zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0    
              zeMemAllocDevice = 2     | 
                zeMemAllocHost = 0     | 
              zeMemAllocShared = 0     \--->                     zeMemFree = 1     ---> LEAK = 1
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -38 (PI_ERROR_INVALID_MEM_OBJECT) -38 (PI_ERROR_INVALID_MEM_OBJECT)
 
 
Aborted (core dumped)

crash2.txt

@fredlarochelle
Copy link

I am able to replicate the same issue on Fedora 37 with 6.2 and Ubuntu 22.04 with 5.19. Both instances involve a build from the latest xpu-master branch.

@BA8F0D39
Copy link
Author

It is weird the crash error is only reported when you enable DEBUG flags, otherwise the code silently crashes.

export SYCL_PI_TRACE=-1
export ZE_DEBUG=-1

@fredlarochelle
Copy link

Here is some quick findings I had, it's not exactly at 4GB, I don't think the gibberish is related...

# All good
import torch
import intel_extension_for_pytorch as ipex

array = torch.rand(40000, 40000, dtype=torch.bfloat16, device='xpu')

print(f"The memory of the array is {(array.element_size() * array.nelement()) / 1e9}GB.") #3.2GB
print("Mean:", torch.mean(array).item()) #0.5
print("Standard Deviation:", torch.std(array).item()) #0.287109375
# All good
import torch
import intel_extension_for_pytorch as ipex

array = torch.rand(46000, 46000, dtype=torch.bfloat16, device='xpu')

print(f"The memory of the array is {(array.element_size() * array.nelement()) / 1e9}GB.") #4.232GB
print("Mean:", torch.mean(array).item()) #0.5
print("Standard Deviation:", torch.std(array).item()) #0.2890625
# At 46001x46001 it goes gibberish
import torch
import intel_extension_for_pytorch as ipex

array = torch.rand(46001, 46001, dtype=torch.bfloat16, device='xpu')

print(f"The memory of the array is {(array.element_size() * array.nelement()) / 1e9}GB.") #4.423218400GB
print("Mean:", torch.mean(array).item()) #0.00372314453125
print("Standard Deviation:", torch.std(array).item()) #0.049072265625

For FP16, I have some other weird bugs that sometimes it works, sometimes it doesn't even for small array (less than 10000x10000). Even for multiple consecutive run, it might work for 50 times in a row, than go bonkers for 10.

For FP32, the gibberish starts appearing at around 30800x30800 which is 3.79456GB. Before that starting around 30400x30400, it is gibberish and then a good output in alternance when doing multiple succesive runs.

Which such numerical instability, I might write a script and test every possible combination at this point, might be worth to take a look at other random sampling methods too.

@fredlarochelle
Copy link

Just did another quick run for FP32 at 30800x30800 and this time, it works just fine (even 32000x32000 works this time around), there is some weird instability going on...

Quick thought, since I am not using a fixed seed in those tests, might it be that some "bad seeds" are cause the instability?

@BA8F0D39
Copy link
Author

BA8F0D39 commented May 25, 2023

@fredlarochelle
I think some pointers in OneDNN GPU kernel use 32bit unsigned integers and some use 64bit unsigned integers. Reading more than 4GB creates a buffer over-read (reading adjacent memory locations and reading other arrays).

If the adjacent memory locations just so happens to have zeros, then the mean is around 0.

If the adjacent memory locations just so happens to have uniformly distributed values from 0 to 1, then the mean is 0.5 .

It could allow you to read other program's data in the GPU.

@fredlarochelle
Copy link

@BA8F0D39 That would make sense, but I still do get the instability for FP16 and FP32 start acting weird before it before it would actually overfill a 32bit buffer + instability, there is probably more than one problem going on at the same time.

@fengyuan14
Copy link

0.2890625

@fredlarochelle @BA8F0D39 Thanks for feedbacks.

The issue mentioned here (so-called numerical instability) looks like one we met recently in internal test. The issue might be caused cache consistency after global memory fence. We are following.

BTW, as for crashes when allocating memory larger than 4GB, we cannot reproduce on recommended driver.

@BA8F0D39
Copy link
Author

@arthuryuan1987
On Windows 11 with WSL, it crashes 100% of the time.

On Ubuntu Linux 22.04 with 5.19 out of tree driver (intel-i915-dkms intel-platform-vsec-dkms intel-platform-cse-dkms intel-fw-gpu), it randomly crashes and it is not deterministic.
https://dgpu-docs.intel.com/driver/client/overview.html

On Ubuntu Linux 22.04 with 6.3 mainline kernel, it also randomly crashes.

I can force it to crash 100% of the time if you enable debug flags.

export SYCL_PI_TRACE=-1
export ZE_DEBUG=-1

@fredlarochelle
Copy link

@arthuryuan1987 I am on Ubuntu 22.04.2 5.19.0.41-generic, on the lastest driver, all following the installation instructions in the documentation with a build from the lastest commit in the xpu-master branch.

@BA8F0D39
Copy link
Author

BA8F0D39 commented Jun 9, 2023

@arthuryuan1987

I used a Vulkan GPU memory tester.
https://github.com/GpuZelenograd/memtest_vulkan

It seems all memory regions above 4GB are corrupt and the read transfer speed is 1.9 GB/s.

./memtest_vulkan 1 9140000000 
Error found. Mode NEXT_RE_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0x30000000..=0xAFFFFFFF  iteration:1
values range: 0x00000000..=0x00000000   FFFFFFFF-like count:0    bit-level stats table:
         0x0 0x1  0x2 0x3| 0x4 0x5  0x6 0x7| 0x8 0x9  0xA 0xB| 0xC 0xD  0xE 0xF
SinglIdx                 |   1             |                 |       1         
TogglCnt       2   56 761|6673 42k 205k793k|  2m  6m  14m 27m| 45m 63m  76m 81m
   0x1?  74m 58m  40m 24m| 12m  5m   1m589k|145k 28k 4277 457|  31   1         
1sInValu536m             |                 |                 |                 

Error found. Mode INITIAL_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0xE0000000..=0x15FFFFFFF  iteration:1
values range: 0x00000000..=0x00000000   FFFFFFFF-like count:0    bit-level stats table:
         0x0 0x1  0x2 0x3| 0x4 0x5  0x6 0x7| 0x8 0x9  0xA 0xB| 0xC 0xD  0xE 0xF
SinglIdx                 |   1             |                 |       1         
TogglCnt       2   56 761|6673 42k 205k793k|  2m  6m  14m 27m| 45m 63m  76m 81m
   0x1?  74m 58m  40m 24m| 12m  5m   1m589k|145k 28k 4277 457|  31   1         
1sInValu536m             |                 |                 |                 

Error found. Mode INITIAL_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0x190000000..=0x20FFFFFFF  iteration:1
values range: 0x00000000..=0x00000000   FFFFFFFF-like count:0    bit-level stats table:
         0x0 0x1  0x2 0x3| 0x4 0x5  0x6 0x7| 0x8 0x9  0xA 0xB| 0xC 0xD  0xE 0xF
SinglIdx                 |   1             |                 |       1         
TogglCnt       2   56 761|6672 42k 205k793k|  2m  6m  14m 27m| 45m 63m  76m 81m
   0x1?  74m 58m  40m 24m| 12m  5m   1m589k|145k 28k 4277 457|  31   1         
1sInValu536m             |                 |                 |                 

Standard 5-minute test of 1: Bus=0x03:00 DevId=0x56A0   16GB Intel(R) Arc(tm) A770 Graphics (DG2)
      1 iteration. Passed  5.6310 seconds  written:    5.5GB 956.2GB/sec        checked:    8.2GB   1.5GB/sec
Error found. Mode NEXT_RE_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0x30000000..=0xAFFFFFFF  iteration:1
values range: 0x00000000..=0x00000000   FFFFFFFF-like count:0    bit-level stats table:
         0x0 0x1  0x2 0x3| 0x4 0x5  0x6 0x7| 0x8 0x9  0xA 0xB| 0xC 0xD  0xE 0xF
SinglIdx                 |   1             |                 |       1         
TogglCnt       2   56 761|6673 42k 205k793k|  2m  6m  14m 27m| 45m 63m  76m 81m
   0x1?  74m 58m  40m 24m| 12m  5m   1m589k|145k 28k 4277 457|  31   1         
1sInValu536m             |                 |                 |                 

Error found. Mode INITIAL_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0xE0000000..=0x15FFFFFFF  iteration:2
values range: 0x00000000..=0x00000000   FFFFFFFF-like count:0    bit-level stats table:
         0x0 0x1  0x2 0x3| 0x4 0x5  0x6 0x7| 0x8 0x9  0xA 0xB| 0xC 0xD  0xE 0xF
SinglIdx                 |   1             |                 |       1         
TogglCnt       2   56 760|6653 42k 204k789k|  2m  6m  14m 27m| 45m 63m  76m 81m
   0x1?  74m 58m  40m 24m| 12m  5m   1m589k|145k 28k 4277 457|  31   1         
1sInValu536m             |                 |                 |                 

@fengyuan14
Copy link

@BA8F0D39 I checked the repo, https://github.com/GpuZelenograd/memtest_vulkan
It should be OpenCL based application (tool). As I know, A64 stateless addressing has a big performance penalty on ARC. Maybe, I guess OpenCL driver disables >4GB allocation. Regarding stacks of IPEX, not all underlying stacks guarantee A64 stateless addressing. So after next code synchronization, IPEX will raise an explicit error to users, as well.

@fredlarochelle
Copy link

Could you please provide an update on the status of this issue? On the lastest xpu_master branch, I have observed that it is currently exhibiting intermittent behavior. At times, when allocating a batch size larger than 4 GB, it crashes with the -5 error, while other times it functions correctly without any issues. Or might the -5 error I am getting be related to another issue? Interestingly, from my observations, the error does not seem to occur when the batch size remains under 4 GB.

@vampireLibrarianMonk
Copy link

I second this issue. As a result I am switching hardware and seeing how I fair with rocm then as a final fallback cuda.

@mahiro21h
Copy link

mahiro21h commented Jan 3, 2025

i got it to work by making a very stupid patch to intel-compute-runtime, and i was able to run the code below but the result was around 0.24 i think.

import torch
import intel_extension_for_pytorch

torch.manual_seed(0)

x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')

print(torch.mean(x).item())

@RogerWeihrauch
Copy link

RogerWeihrauch commented Jan 3, 2025

@mahiro21h :
Hi there. I hope I did not get u wrong, but, your 'fix/workaround':
Since '.. device='xpu' ..': are you really sure you do actually all the 'rendering' on your Intel GPU and not on your CPU?
@simonlui:
Thank you very much for u'r hint, but:
since u are running on/in Docker env; are u sure this will also solve 'my'/Intel's problem on the 4GB limit w/ my dedicated on-prem ARC GPU?
And if so, I am sorry since I am not a really good/hardcore programmer:
Could you guide me on the respective steps/link me some guides on how to perform those?
Thank u all for u'r effort to get this solved.
Regards,
Roger

@mahiro21h
Copy link

@RogerWeihrauch yes, i tested the same code before and after applying the patch. it's using my arc a770

@simonlui
Copy link

simonlui commented Jan 4, 2025

@mahiro21h

i got it to work by making a very stupid patch to intel-compute-runtime, and i was able to run the code below but the result was around 0.24 i think.

Right which is the same result that the issue opener ran into, the memory allocation is invalid past the 4GB mark and has garbage hence why you don't get 0.5 or something close to it which is what you should get.

@RogerWeihrauch

since u are running on/in Docker env; are u sure this will also solve 'my'/Intel's problem on the 4GB limit w/ my dedicated on-prem ARC GPU?

No, this is a universal limitation. Intel needs to do something to solve it for all of us.

@RogerWeihrauch
Copy link

@mahiro21h :
Ok, thanks for u'r response. So, since, as already mentioned from my side, I am not a Dev/Prgmer, could u pls give me a rough overview/guide me roughly which steps to be taken to get this corrected?
tbh.: I nearly lost the overview on 'my' issue now, what has to be done how to get my A770 run in ComfyUI now successfully.
This will be highly appreciated.
Thanks in advance,
Roger

@RogerWeihrauch
Copy link

@ALL
So, is there any INTEL employee/member/responsible to take care for this limitation to be solved?
Seems anyone involved here has been beamed away and I am still suffering from this issue.
If I could I would try to solve it on my way, but I am not capable of this.
So, anyone there?
Regards,
Roger

@tye1
Copy link

tye1 commented Jan 14, 2025

@RogerWeihrauch We have intel employee @CaoZhongZ who is revisiting this issue now. Per current evaluation, >4GB allocation from Framework is dependent on >4GB support for GPU matmul feature on oneDNN. We will update here once we have more solid results.

@CaoZhongZ
Copy link

I've being digging the whole stack for points that prevent the allocation. So, after removing the allocation blocker inside IPEX, please try this environment variable:

export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1

See if it works.

@simonlui
Copy link

I am guessing this needs a modified IPEX without the usual check? I get the usual runtime error should I try it on the released vanilla version of IPEX 2.5.10+xpu.

❯ export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
❯ python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"
[W115 11:01:30.383135487 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971 (function operator())
2.5.1+cxx11.abi
2.5.10+xpu
[0]: _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.3.30049.600000', total_memory=15473MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=32, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1)
❯ python
Python 3.12.5 | Intel Corporation | (main, Sep  9 2024, 23:35:37) [GCC 14.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import intel_extension_for_pytorch
[W115 11:01:41.343038184 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971 (function operator())
>>> a = torch.zeros([1024, 1024, 1024, 2], device='xpu')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Current platform can NOT allocate memory block with size larger than 4GB! Tried to allocate 8.00 GiB (GPU  0; 15.11 GiB total capacity; 0 bytes already allocated; 0 bytes reserved in total by PyTorch)

I will rebuild a custom version again without the check on the xpu-main branch and report back.

@mahiro21h
Copy link

@RogerWeihrauch hi, i was going to wait before posting the steps until i figured out what's wrong with the allocation, but i instead ended up wasting a week trying to get ipex to build again... you're most likely going to simply get garbage output when using it in comfy but i'll post the patches anyway. i don't know what's the best way to share them with everyone so i'll put them on pastebin

@mahiro21h
Copy link

i made a mistake in my original patch for intel compute runtime and didn't include a certain change when i tested ipex and posted my results. the patch im sending contains the change, so this may improve the results or have no effect at all. if you don't want the change just remove the change made to ze_module_api_entrypoints.h, however, i'd like someone to test the patch as-is and see if anything changed.

patches:
intel compute runtime patch
ipex patch

@CaoZhongZ
Copy link

@simonlui In this case, you could just try PyTorch XPU release without IPEX to test whether the system could allocate >4GB memory. The upstream allocator is now removed limitation check. We'll remove the limitation on ARC770 in next release. You could wait if not in a hurry.

@simonlui
Copy link

@CaoZhongZ It seems to work without the check, although I haven't tested if it actually spits out the right things with the code with the torch.rand mean or something like Stable Diffusion to make sure the allocation is fine since I forgot to specify AOT so it takes too long to compile code for that.

❯ export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
❯ python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"
[W116 01:08:52.043395890 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /root/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /root/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /root/intel-extension-for-pytorch/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2901 (function operator())
[W116 01:09:11.971472994 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /root/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /root/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /root/intel-extension-for-pytorch/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2901 (function operator())
2.5.0a0+gita8d6afb
2.5.10+git23d1598
[0]: _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.3.30049.600000', total_memory=15473MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=32, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1)
❯ python
Python 3.12.5 | Intel Corporation | (main, Sep  9 2024, 23:35:37) [GCC 14.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
[W116 01:09:11.971472994 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /root/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /root/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /root/intel-extension-for-pytorch/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2901 (function operator())
>>> import intel_extension_for_pytorch
[W116 01:09:11.971472994 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /root/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /root/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /root/intel-extension-for-pytorch/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2901 (function operator())
>>> a = torch.zeros([1024, 1024, 1024, 2], device='xpu')
>>> a.size()
torch.Size([1024, 1024, 1024, 2])

I will test the nightly Pytorch package later and run some more tests to make sure this actually works as intended.

@simonlui
Copy link

@CaoZhongZ I left it on overnight on a large Stable Diffusion task with a big resolution to go >4GB and then let it AOT compile overnight and when I looked at it in the morning, it failed. I then ran the code from the issue opener and I see the same issue with the allocation where XPU is giving a value around half of what it should be at 0.25 while it should be near 0.5. Did a comparison with CPU also.

❯ export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
❯ python
Python 3.12.5 | Intel Corporation | (main, Sep  9 2024, 23:35:37) [GCC 14.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
[W116 09:24:08.182953932 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /root/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /root/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /root/intel-extension-for-pytorch/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2901 (function operator())
[W116 09:24:11.960657210 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
    registered at /root/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /root/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
       new kernel: registered at /root/intel-extension-for-pytorch/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2901 (function operator())
>>> import intel_extension_for_pytorch
>>> torch.manual_seed(0)
<torch._C.Generator object at 0x7f5361b51230>
>>> x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')
>>> y = torch.rand(47000, 47000, dtype=torch.float32, device='cpu')
>>> print(torch.mean(x).item())
0.24303479492664337
>>> print(torch.mean(y).item())
0.5000026822090149

You see the same result with the nightly Pytorch xpu package.

❯ export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
❯ python -c "import torch; print(torch.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"
2.7.0.dev20250110+xpu
[0]: _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.3.30049.600000', total_memory=15473MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=32, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1)
❯ python
Python 3.12.5 | Intel Corporation | (main, Sep  9 2024, 23:35:37) [GCC 14.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.manual_seed(0)
<torch._C.Generator object at 0x7f8d9bf690f0>
>>> x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')
>>> y = torch.rand(47000, 47000, dtype=torch.float32, device='cpu')
>>> print(torch.mean(x).item())
0.24304091930389404
>>> print(torch.mean(y).item())
0.49999409914016724

Seems like the issue is still not resolved and we're back to how IPEX operated before the allocation limits were placed.

@CaoZhongZ
Copy link

Thanks for the triage, working on it.

@RogerWeihrauch
Copy link

@ALL: Hi
I want to thank you all for your free-will and effort on this.
@mahiro21h
Thanks for your hint.
Q@all:
May somebody pls give a rule-of-thumb on:
HowTo/Will this solve my problem using an ipex like option in ComfyUI (as e.g. in SDNext/Automatic)?
Other way round: Will I be able to use more than 4GB of my Arc A770's VRam for bit bigger models in ComfyUI?
Thanks and regards,
Roger

@mahiro21h
Copy link

mahiro21h commented Jan 18, 2025

@RogerWeihrauch

Will I be able to use more than 4GB of my Arc A770's VRam for bit bigger models in ComfyUI?

no, not yet sadly.

@mahiro21h
Copy link

@simonlui how big was the image you tried to generate with stable diffusion? i was able to generate images as big as 1792x1792 in automatic1111. it took about 6 minutes per image

@simonlui
Copy link

simonlui commented Jan 18, 2025

how big was the image you tried to generate with stable diffusion? i was able to generate images as big as 1792x1792 in automatic1111. it took about 6 minutes per image

For this instance for the purposes of testing after the AOT compile worked? 1536x1536 with no VRAM saving techniques for quality and --gpu-only with NoobAI EPS in ComfyUI to force an allocation over 4GB. I wasn't looking necessarily for pushing the limit here but to see if forced, IPEX would do the right thing. It failed with a runtime error in Intel's UR or Unified Runtime which is opaque to us as far as I know.

@mahiro21h
Copy link

mahiro21h commented Jan 19, 2025

It failed with a runtime error in Intel's UR or Unified Runtime

i think we both ran into the same issue. i tried to generate a 1024x1024 image using the latest release of comfy without any vram optimizations or other flags, and i also got a UR error when i click again to generate an image after having already received an out of memory error. if i try to instead generate a 768x768 image, i notice my vram usage shoots up to 13.6gb so i guess there really wasn't enough vram left? also, when you tried to generate that image, it complained that it tried to allocate an absurd amount of memory but couldn't, didn't it? it tried to allocate 40gb in my case so i just switched to automatic1111

@CaoZhongZ
Copy link

Use this repo instead of original if you want to enable 4GB support for ipex-2.5 (Enable bindless kernels)

Modify here:
https://github.com/pytorch/pytorch/blob/803017f3cb73bb115eda5ec0e0a19688ccafbf4e/caffe2/CMakeLists.txt#L1058

to use this:
https://github.com/CaoZhongZ/torch-xpu-ops branch: release/ipex-2.5

Also change here:
https://github.com/pytorch/pytorch/blob/803017f3cb73bb115eda5ec0e0a19688ccafbf4e/third_party/xpu.txt#L1

to release/ipex-2.5

@CaoZhongZ
Copy link

Use

export TORCH_XPU_ARCH_LIST=ats-m150

enable AOT for arc770.

@simonlui
Copy link

simonlui commented Jan 22, 2025

@CaoZhongZ This does work at least for the limited test the issue opener brought up. However, I had some major trouble trying to compile with your branch of xpu-ops with the main branch of Pytorch, I had to revert back to the stable release of Pytorch v2.5.1 and still encountered issues with the output Pytorch wheel so I can not run something like Stable Diffusion to actually test a real application. But it seems like the wheel wasn't borked too much for me to run the test.

❯ export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
❯ python -c "import torch; print(torch.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<string>", line 1, in <module>
  File "/home/simonlui/Code_Repositories/pytorch/torch/__init__.py", line 2475, in <module>
    from torch import (
  File "/home/simonlui/Code_Repositories/pytorch/torch/export/__init__.py", line 64, in <module>
    from .dynamic_shapes import Constraint, Dim, dims, ShapesCollection
  File "/home/simonlui/Code_Repositories/pytorch/torch/export/dynamic_shapes.py", line 23, in <module>
    from .exported_program import ExportedProgram
  File "/home/simonlui/Code_Repositories/pytorch/torch/export/exported_program.py", line 26, in <module>
    from torch._higher_order_ops.utils import autograd_not_implemented
  File "/home/simonlui/Code_Repositories/pytorch/torch/_higher_order_ops/__init__.py", line 1, in <module>
    from torch._higher_order_ops.cond import cond
  File "/home/simonlui/Code_Repositories/pytorch/torch/_higher_order_ops/cond.py", line 6, in <module>
    import torch._subclasses.functional_tensor
  File "/home/simonlui/Code_Repositories/pytorch/torch/_subclasses/functional_tensor.py", line 46, in <module>
    class FunctionalTensor(torch.Tensor):
  File "/home/simonlui/Code_Repositories/pytorch/torch/_subclasses/functional_tensor.py", line 295, in FunctionalTensor
    cpu = _conversion_method_template(device=torch.device("cpu"))
/home/simonlui/Code_Repositories/pytorch/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: 
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

 (Triggered internally at /home/simonlui/Code_Repositories/pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
2.5.0a0+gita8d6afb
[0]: _XpuDeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.3.30049.600000', total_memory=15473MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=32, max_work_group_size=1024, max_num_sub_groups=128, sub_group_sizes=[8 16 32], has_fp16=1, has_fp64=0, has_atomic64=1)
❯ python
Python 3.12.5 | Intel Corporation | (main, Sep  9 2024, 23:35:37) [GCC 14.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<stdin>", line 1, in <module>
  File "/home/simonlui/Code_Repositories/pytorch/torch/__init__.py", line 2475, in <module>
    from torch import (
  File "/home/simonlui/Code_Repositories/pytorch/torch/export/__init__.py", line 64, in <module>
    from .dynamic_shapes import Constraint, Dim, dims, ShapesCollection
  File "/home/simonlui/Code_Repositories/pytorch/torch/export/dynamic_shapes.py", line 23, in <module>
    from .exported_program import ExportedProgram
  File "/home/simonlui/Code_Repositories/pytorch/torch/export/exported_program.py", line 26, in <module>
    from torch._higher_order_ops.utils import autograd_not_implemented
  File "/home/simonlui/Code_Repositories/pytorch/torch/_higher_order_ops/__init__.py", line 1, in <module>
    from torch._higher_order_ops.cond import cond
  File "/home/simonlui/Code_Repositories/pytorch/torch/_higher_order_ops/cond.py", line 6, in <module>
    import torch._subclasses.functional_tensor
  File "/home/simonlui/Code_Repositories/pytorch/torch/_subclasses/functional_tensor.py", line 46, in <module>
    class FunctionalTensor(torch.Tensor):
  File "/home/simonlui/Code_Repositories/pytorch/torch/_subclasses/functional_tensor.py", line 295, in FunctionalTensor
    cpu = _conversion_method_template(device=torch.device("cpu"))
/home/simonlui/Code_Repositories/pytorch/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: 
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

 (Triggered internally at /home/simonlui/Code_Repositories/pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
>>> torch.manual_seed(0)
<torch._C.Generator object at 0x7efd65f4ccf0>
>>> x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')
>>> y = torch.rand(47000, 47000, dtype=torch.float32, device='cpu')
>>> print(torch.mean(x).item())
0.4999975860118866
>>> print(torch.mean(y).item())
0.49999409914016724

As the two mean values with xpu and cpu device allocations for torch.rand are close to 0.5, the allocation seems to have succeeded for xpu and worked.

Some questions though about this fix.

1.) To use this, will you need to use UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1 when this fix is included inside a Pytorch build for any Alchemist based GPU? Or is this a temporary environment variable?
2.) Given that it seemed like there was a performance penalty before for this kind of allocation and hence why there was reluctance until recently about making a fix for this, is there a performance penalty with this "bindless kernels" usage in Pytorch?
3.) What is the schedule for including this fix? I know you said you are going to include it in the next version of IPEX but I do hope this can be included into a nightly Pytorch release soon so I can test it in Stable Diffusion.

Thank you for your patience in working with me.

@CaoZhongZ
Copy link

The "UR" in environment variable means "unified runtime" which is an integrated part of compiler. So, if we want to ditch the environment variable we have to update compiler, and this is something people have to wait a little bit longer.

We don't quite know the empirical performance impact of bindless on ARC. It affects all those kernels that didn't work on >4GB buffers but not those already functioning. And for diffusers and transformers the large portion of performance generated by systolic so I would expected not so much end-to-end change.

The decision to fix the issue depends on maturity of toolchain and driver. Given we have released BMG which in nature is bindless for all occasions, we believe the time has come.

The release schedule is what we will discuss so I'll bring up the nightly release thing.

@simonlui
Copy link

simonlui commented Jan 26, 2025

@CaoZhongZ So good news and bad news. I managed to get this working on top of IPEX v2.5.10+xpu, had to make a custom patch and put it into the torch_patches folder with the modifications you described and I was able to reproduce the test from above with torch.rand() with the correct results after installing that build. However, attempting to use more than 4GB of VRAM does not work with the test I wanted to do with Stable Diffusion. If I use ComfyUI and the default workflow from https://comfyanonymous.github.io/ComfyUI_examples/flux/ using Flux Dev in FP8 at 1024x1024 with no other VRAM saving tricks like using quantized CLIP/T5, it will fail.

got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: xpu:0, offload device: cpu, dtype: torch.bfloat16
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
CLIP/text encoder model load device: xpu:0, offload device: cpu, current: cpu, dtype: torch.float32
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
loaded completely 14254.39765625 9555.0751953125 True
2025-01-25 23:51:14,199 - _logger.py - IPEX - INFO - Currently split master weight for xpu only support sgd
2025-01-25 23:51:14,207 - _logger.py - IPEX - INFO - Conv BatchNorm folding failed during the optimize process.
2025-01-25 23:51:14,212 - _logger.py - IPEX - INFO - Linear BatchNorm folding failed during the optimize process.
2025-01-25 23:51:14,212 - _logger.py - IPEX - WARNING - [NotSupported]failed to apply concat_linear on unet, please report bugs
Requested to load Flux
loaded completely 13429.94521875 11350.067443847656 True
2025-01-25 23:51:20,690 - _logger.py - IPEX - INFO - Currently split master weight for xpu only support sgd
2025-01-25 23:51:20,698 - _logger.py - IPEX - INFO - Conv BatchNorm folding failed during the optimize process.
2025-01-25 23:51:20,704 - _logger.py - IPEX - INFO - Linear BatchNorm folding failed during the optimize process.
2025-01-25 23:51:20,705 - _logger.py - IPEX - WARNING - [NotSupported]failed to apply concat_linear on unet, please report bugs
  0%|                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 0/20 [00:03<?, ?it/s]
!!! Exception during processing !!! UR error
Traceback (most recent call last):
  File "/home/simonlui/Code_Repositories/ComfyUI/execution.py", line 327, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/execution.py", line 202, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/execution.py", line 174, in _map_node_over_list
    process_inputs(input_dict, i)
  File "/home/simonlui/Code_Repositories/ComfyUI/execution.py", line 163, in process_inputs
    results.append(getattr(obj, func)(**inputs))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy_extras/nodes_custom_sampler.py", line 651, in sample
    samples = guider.sample(noise.generate_noise(latent), latent_image, sampler, sigmas, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise.seed)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/samplers.py", line 984, in sample
    output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/patcher_extension.py", line 110, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/samplers.py", line 952, in outer_sample
    output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/samplers.py", line 935, in inner_sample
    samples = executor.execute(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/patcher_extension.py", line 110, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/samplers.py", line 714, in sample
    samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/.conda/envs/comfyui-test/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/k_diffusion/sampling.py", line 161, in sample_euler
    denoised = model(x, sigma_hat * s_in, **extra_args)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/samplers.py", line 379, in __call__
    out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/samplers.py", line 915, in __call__
    return self.predict_noise(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/samplers.py", line 918, in predict_noise
    return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/samplers.py", line 359, in sampling_function
    out = calc_cond_batch(model, conds, x, timestep, model_options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/samplers.py", line 195, in calc_cond_batch
    return executor.execute(model, conds, x_in, timestep, model_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/patcher_extension.py", line 110, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/samplers.py", line 308, in _calc_cond_batch
    output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/model_base.py", line 131, in apply_model
    return comfy.patcher_extension.WrapperExecutor.new_class_executor(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/patcher_extension.py", line 110, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/model_base.py", line 162, in _apply_model
    model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/.conda/envs/comfyui-test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/.conda/envs/comfyui-test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/ldm/flux/model.py", line 203, in forward
    out = self.forward_orig(img, img_ids, context, txt_ids, timestep, y, guidance, control, transformer_options, attn_mask=kwargs.get("attention_mask", None))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/ldm/flux/model.py", line 142, in forward_orig
    img, txt = block(img=img,
               ^^^^^^^^^^^^^^
  File "/home/simonlui/.conda/envs/comfyui-test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/.conda/envs/comfyui-test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/ldm/flux/layers.py", line 174, in forward
    attn = attention(torch.cat((txt_q, img_q), dim=2),
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simonlui/Code_Repositories/ComfyUI/comfy/ldm/flux/math.py", line 13, in attention
    q = q.float().reshape(*q.shape[:-1], -1, 1, 2)
        ^^^^^^^^^
RuntimeError: UR error

Prompt executed in 34.05 seconds

I can open a separate issue for this if needed but yeah, it seems like there are still issues in the UR runtime possibly that might make using >4GB of VRAM allocations still not practical even if it technically works.

@kodiconnect
Copy link

@ALL: Hi
I want to thank you all for your free-will and effort on this.
@mahiro21h
Thanks for your hint.
Q@all:
May somebody pls give a rule-of-thumb on:
HowTo/Will this solve my problem using an ipex like option in ComfyUI (as e.g. in SDNext/Automatic)?
Other way round: Will I be able to use more than 4GB of my Arc A770's VRam for bit bigger models in ComfyUI?
Thanks and regards,
Roger

I had success with this:

ENV PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

Although I'm sure it doesn't correct the core issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ARC ARC GPU Crash Execution crashes
Projects
None yet
Development

No branches or pull requests