Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic support for Intel XPU (Arc Graphics) #409

Merged
merged 6 commits into from
Apr 7, 2023

Conversation

kwaa
Copy link
Contributor

@kwaa kwaa commented Apr 5, 2023

closed #387

Note:

need to install the oneAPI Base Toolkit first, on Arch Linux it is paru -S intel-oneapi-basekit intel-compute-runtime-bin

Use AUR's intel-compute-runtime-bin instead of intel-compute-runtime to avoid the Assertion '__n < this->size()' failed. error

Maybe it also depends on the oneAPI AI Analytics Toolkit, I'm not sure

Then run:

# setvars
source /opt/intel/oneapi/setvars.sh
# install dependencies
pip install torch==1.13.0a0 torchvision==0.14.1a0 intel_extension_for_pytorch==1.13.10+xpu -f https://developer.intel.com/ipex-whl-stable-xpu
pip install -r requirements.txt
# run
python main.py --use-split-cross-attention

If --use-split-cross-attention is not used, output is noise.

Verify XPU availability under the ComfyUI folder where dependencies have been installed:

[user@host ComfyUI]$ python
>>> import torch
>>> import intel_extension_for_pytorch
>>> torch.xpu.is_available()
True
>>> torch.xpu.get_device_properties('xpu')
_DeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=15473MB, max_compute_units=512)

Known issues:

  • Noise occurs when batch is larger than one, even with --use-split-cross-attention
    • There is always about a 50% probability of noise generation
    • If --use-split-cross-attention is not used, it is 100%
  • Some samplers and schedulers cannot be used

@kwaa kwaa changed the title Basic support for Intel XPU Basic support for Intel XPU (Arc Graphics) Apr 5, 2023
@kwaa
Copy link
Contributor Author

kwaa commented Apr 5, 2023

PS: If use torch.xpu.optimize on the model, will get the same error as intel/intel-extension-for-pytorch#319

elif vram_state == XPU:
+   real_model = torch.xpu.optimize(real_model)
    real_model.to("xpu")
    pass

@comfyanonymous
Copy link
Owner

By default should XPU have priority over CUDA if both are present?

@kwaa
Copy link
Contributor Author

kwaa commented Apr 5, 2023

By default should XPU have priority over CUDA if both are present?

It will first check whether intel_extension_for_pytorch is installed, if not installed, it is cuda

@comfyanonymous
Copy link
Owner

I'm asking because some computers have integrated intel GPUs with a Nvidia GPU and I'm wondering if it could cause issues.

@kwaa
Copy link
Contributor Author

kwaa commented Apr 5, 2023

I'm asking because some computers have integrated intel GPUs with a Nvidia GPU and I'm wondering if it could cause issues.

Probably not, because pytorch does not support XPU by default, need to install intel_extension_for_pytorch; and this behavior can be seen as wanting to use XPU first

@kwaa
Copy link
Contributor Author

kwaa commented Apr 6, 2023

@comfyanonymous Is it possible to merge this PR, or what else is wrong with it?

comfy/model_management.py Outdated Show resolved Hide resolved
@kwaa
Copy link
Contributor Author

kwaa commented Apr 6, 2023

Update: I noticed the Experimental Codeless Optimization (ipexrun), but it seems to be cpu-only for now.

Consider manually adding torch.xpu.optimize, torch.xpu.amp.autocast, torch.xpu.empty_cache... would be very cumbersome and not easy to maintain, this seems to be a good solution.

Waiting for v2.0.0+xpu to be released...

@comfyanonymous
Copy link
Owner

After looking at this pull request a bit XPU should not be treated as another vram state. It makes sense for CPU and MPS to be because they don't have any vram but with XPU there's actually vram so it would be good if --lowvram and --highvram worked.

@kwaa
Copy link
Contributor Author

kwaa commented Apr 6, 2023

After looking at this pull request a bit XPU should not be treated as another vram state. It makes sense for CPU and MPS to be because they don't have any vram but with XPU there's actually vram so it would be good if --lowvram and --highvram worked.

That may need to be changed in more depth, but I do agree with that.

@kwaa kwaa requested a review from comfyanonymous April 6, 2023 06:25
@kwaa
Copy link
Contributor Author

kwaa commented Apr 6, 2023

Ah, there are some problems with --lowvram, pls wait

Looks related to this issue: #39
After executing pip install accelerate --upgrade everything works fine

Tested:

  • LOW_VRAM ✅
  • NORMAL_VRAM ✅
  • HIGH_VRAM ✅

@kwaa kwaa requested a review from comfyanonymous April 7, 2023 01:12
@comfyanonymous comfyanonymous merged commit 28a7205 into comfyanonymous:master Apr 7, 2023
@comfyanonymous
Copy link
Owner

I did a small refactor so if you can confirm it still works that would be great: bceccca

@kwaa
Copy link
Contributor Author

kwaa commented Apr 7, 2023

I did a small refactor so if you can confirm it still works that would be great: bceccca

No problem, it still works.

@kotx
Copy link

kotx commented Apr 16, 2023

Hi, I am unable to get any sort of generation working on my A770 by following the instructions in the first post:

Traceback (most recent call last):
  File "/home/kot/Documents/ComfyUI/execution.py", line 184, in execute
    executed += recursive_execute(self.server, prompt, self.outputs, x, extra_data)
  File "/home/kot/Documents/ComfyUI/execution.py", line 60, in recursive_execute
    executed += recursive_execute(server, prompt, outputs, input_unique_id, extra_data)
  File "/home/kot/Documents/ComfyUI/execution.py", line 60, in recursive_execute
    executed += recursive_execute(server, prompt, outputs, input_unique_id, extra_data)
  File "/home/kot/Documents/ComfyUI/execution.py", line 69, in recursive_execute
    outputs[unique_id] = getattr(obj, obj.FUNCTION)(**input_data_all)
  File "/home/kot/Documents/ComfyUI/nodes.py", line 768, in sample
    return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
  File "/home/kot/Documents/ComfyUI/nodes.py", line 699, in common_ksampler
    comfy.model_management.load_model_gpu(model)
  File "/home/kot/Documents/ComfyUI/comfy/model_management.py", line 168, in load_model_gpu
    real_model.to(get_torch_device())
  File "/home/kot/Documents/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in to
    return self._apply(convert)
  File "/home/kot/Documents/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 639, in _apply
    module._apply(fn)
  File "/home/kot/Documents/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 639, in _apply
    module._apply(fn)
  File "/home/kot/Documents/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 639, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/home/kot/Documents/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 662, in _apply
    param_applied = fn(param)
  File "/home/kot/Documents/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 985, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: Native API failed. Native API returns: -997 (The plugin has emitted a backend specific error) -997 (The plugin has emitted a backend specific error)

A quick Google search shows me this issue: intel/compute-runtime#617, so it seems it's failing to allocate more than 8GB VRAM.
Is there any fix for it? I'm on Manjaro with kernel 6.1.23-1, libva-intel-driver 2.4.1-2 (unsure if it is relevant).

@kwaa
Copy link
Contributor Author

kwaa commented Apr 17, 2023

Hi, I am unable to get any sort of generation working on my A770 by following the instructions in the first post:

A quick Google search shows me this issue: intel/compute-runtime#617, so it seems it's failing to allocate more than 8GB VRAM. Is there any fix for it? I'm on Manjaro with kernel 6.1.23-1, libva-intel-driver 2.4.1-2 (unsure if it is relevant).

I have not encountered this problem. Are you using AUR's intel-compute-runtime-bin?

@kotx
Copy link

kotx commented Apr 17, 2023

Are you using AUR's intel-compute-runtime-bin?

Yes, I used the paru command in the first post.

@kwaa
Copy link
Contributor Author

kwaa commented Apr 17, 2023

Yes, I used the paru command in the first post.

Perhaps you could try installing the oneAPI AI Kit, see #476

Or LOW_VRAM mode: python main.py --use-split-cross-attention --lowvram

@kotx
Copy link

kotx commented Apr 18, 2023

Sadly no luck with AIKit or lowvram. Does it matter that I'm using my own instance of Python instead of Intel Python (included in AIKit)?

@kwaa
Copy link
Contributor Author

kwaa commented Apr 18, 2023

Sadly no luck with AIKit or lowvram. Does it matter that I'm using my own instance of Python instead of Intel Python (included in AIKit)?

If you are not using the Intel Python that comes with AI Kit, installing it will make no noticeable difference

@kotx
Copy link

kotx commented Apr 19, 2023

If you are not using the Intel Python that comes with AI Kit, installing it will make no noticeable difference

I recreated the venv with Intel Python:

(venv) [kot@rin ComfyUI]$ python -V
Python 3.9.15 :: Intel Corporation

But I get the same error as before. Model is Counterfeit v2.5.

@kwaa
Copy link
Contributor Author

kwaa commented Apr 19, 2023

But I get the same error as before. Model is Counterfeit v2.5.

Hmm.... This is a bit tricky. Can you test the output of the sycl-ls command and torch.xpu.is_available(), torch.xpu.get_device_properties('xpu') in python?

@kotx
Copy link

kotx commented Apr 20, 2023

sycl-ls:

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2022.15.12.0.01_081451]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 5 3600 6-Core Processor               3.0 [2022.15.12.0.01_081451]
[opencl:gpu:2] Intel(R) OpenCL HD Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.05.25593.11]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.25593]

Python:

Python 3.9.15 (main, Nov 11 2022, 13:58:57) 
[GCC 11.2.0] :: Intel Corporation on linux
Type "help", "copyright", "credits" or "license" for more information.
Intel(R) Distribution for Python is brought to you by Intel Corporation.
Please check out: https://software.intel.com/en-us/python-distribution
>>> import torch
>>> import intel_extension_for_pytorch
[W OperatorEntry.cpp:150] Warning: Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: torchvision::nms
    no debug info
  dispatch key: CPU
  previous kernel: registered at /build/intel-pytorch-extension/csrc/cpu/aten/TorchVisionNms.cpp:47
       new kernel: registered at /opt/workspace/vision/torchvision/csrc/ops/cpu/nms_kernel.cpp:112 (function registerKernel)
/home/kot/Documents/ComfyUI/venv/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")
>>> torch.xpu.is_available()
True
>>> torch.xpu.get_device_properties('xpu')
_DeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=15473MB, max_compute_units=512)
>>> 

@kwaa
Copy link
Contributor Author

kwaa commented Apr 21, 2023

Looks normal, I probably don't have a proper workaround.

But I didn't get this warning:

/home/kot/Documents/ComfyUI/venv/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
   warn(f"Failed to load image Python extension: {e}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Support Intel Extension for PyTorch (IPEX)
4 participants