posts/intel-pytorch-extension-tutorial/native-ubuntu/ #38

utterances-bot · 2023-07-03T06:35:17Z

Christian Mills - Getting Started with Intel’s PyTorch Extension for Arc GPUs on Ubuntu

This tutorial provides a step-by-step guide to setting up Intel’s PyTorch extension on Ubuntu to train models with Arc GPUs.

https://christianjmills.com/posts/intel-pytorch-extension-tutorial/native-ubuntu/

Danyal-sab · 2023-07-04T11:10:23Z

the mamba part doesn't work for me
it installs mamba, however with a warning:
please verify that your PYTHONPATH only points to
directories of packages that are compatible with the Python interpreter
in Mambaforge: /home/daaz/mambaforge

and does not recognize mamba

cj-mills · 2023-07-04T18:13:52Z

Hi @Danyal-sab,

Thanks for pointing that out. I forgot to include the line after running the Mambaforge install script to initialize Mamba. I updated the post with the missing line.

You can run the following commands to initialize Mamba and relaunch the current bash shell to apply the changes:

~/mambaforge/bin/mamba init
bash

I saw you posted another comment earlier, but it got deleted before I could respond. Did you resolve your previous issue?

Danyal-sab · 2023-07-04T20:56:50Z

Hi @cj-mills,

Thanks for your speedy reply. And thanks for your help.

And yes, I posted about a minor issue with executing the orders for "Install oneAPI Base Toolkit" section, where the second line gave typing error, I resolved it by simply removing the \s at the end of lines and after that it performed right. that's why I deleted the post.
similar issue happened in the "Apply OneAPI Patch" section, gave an error in the 6th line. Again I just removed the \s and it went through.

Thanks again for your great help

Danyal-sab · 2023-07-07T10:34:12Z

Hi @cj-mills,

is there any tool for monitoring the Arc gpu memory usage (like nvidia-smi for nvidia)

I checked a few tools like "intel_gpu_top", intel Vtune and intel GPA, but they either they weren't compatible with ubuntu 23.04 or they don't offer monitoring gpu memory usage.

is there other tools we possibly can use?

cj-mills · 2023-07-07T17:59:26Z

Hi @Danyal-sab,

The only one I know of is the sysmon tool included with Intel's Profiling Tools Interfaces for GPU (PTI for GPU) GitHub project.

GitHub Repository

Unfortunately, you would need to compile the tool from the source code.

Also, it does not seem fully functional on my system, as it does not show any running processes:

$ sudo sysmon
=====================================================================================
GPU 0: Intel(R) Arc(TM) A770 Graphics    PCI Bus: 0000:03:00.0
Vendor: Intel(R) Corporation    Driver Version: 1.3.26241    Subdevices: 0
EU Count: 512    Threads Per EU: 8    EU SIMD Width: 8    Total Memory(MB): 15473.6
Core Frequency(MHz): 2000.0 of 2400.0    Core Temperature(C): unknown
=====================================================================================
Running Processes: unknown
=====================================================================================
GPU 1: Intel(R) UHD Graphics 750    PCI Bus: 0000:00:02.0
Vendor: Intel(R) Corporation    Driver Version: 1.3.26241    Subdevices: 0
EU Count: 32    Threads Per EU: 7    EU SIMD Width: 8    Total Memory(MB): 25360.9
Core Frequency(MHz): 350.0 of 1300.0    Core Temperature(C): unknown
=====================================================================================
Running Processes: unknown

ricable · 2023-07-09T14:32:57Z

great tutorial that helped me a lot setting up an environment for ML / DL with Arc GPU. It really saved my life and I hope to read more of this type of excellent materials. Thanks again, much appreciated

psmgeelen · 2023-07-09T22:02:57Z

I really appreciate your work! I am trying to set the GPU up for SciKit monkey-patch: https://github.com/intel/scikit-learn-intelex but I am struggling to go beyond the CPU acceleration. I have no idea how to 1. List the device and 2. to point to that device. Do you have any experience with that?

cj-mills · 2023-07-10T18:19:20Z

Hi @psmgeelen,

I have not tried Intel's Scikit-learn extension, so I don't know if it even supports Arc GPUs. The DPC++ compiler runtime does support Arc GPUs, meaning it should work in theory.

Have you tried the example code for performing computations on the GPU in the extension's documentation?

oneAPI and GPU support in Intel® Extension for Scikit-learn

Based on the example code, the Arc GPU should be the "gpu:0" device, assuming it is the only discrete GPU installed on the system. The integrated graphics should be the "gpu:1" device.

Danyal-sab · 2023-07-12T09:39:43Z

Hi @cj-mills,
I got an issue which may be slightly unrelated to this topic.
At the beginning of working with my a770 it used to crash the system when running a training session after 10ish epochs under medium load. (I don't know exactly how much the load was, as I couldn't monitor the GPU at all, However using my old NVIDIA 1070 the code used to use less than 2 GB of the GPU ram.)

Then I started running heavier codes close to the limit of the a770 and then the crashes stopped for the day.
After two more days it totally stopped crashing.
Now, every now and then it does crash the computer while deep learning training session is running.
Do you have any idea what can be the cause and why it isn't consistent?
I looked up online, there are quite a few who experienced crashing with this card, but didn't find anyone with occasional crash down.
And also the card makes a noise as well that changes from time to time. I suppose it should be coil whining, is it safe?

psmgeelen · 2023-07-12T13:01:15Z

@cj-mills , I have, and it's not finding the device for whatever reason..I created a ticket at intelex here: uxlfoundation/scikit-learn-intelex#1357 (comment)

Danyal-sab · 2023-09-21T00:03:14Z

Hi @cj-mills,
I see that the tutorial has been updated to use the new extension. I saw in your fastai forum that you concluded the extension has a bug. is it safe to install now?

cj-mills · 2023-09-21T00:23:03Z

@Danyal-sab
It depends on what you need to use it for. The code for my image classification tutorial works fine, but the training code for my YOLOX tutorial does not reach usable performance with the intel extension on the ARC GPU.

I have not tested the YOLOX training code with the previous extension because the code requires torchvision 0.15+ (which requires PyTorch 2.0+).

I updated the tutorial because everything I tested that worked with the previous extension version still works with the new version, and the current Ubuntu LTS now ships with a kernel that supports ARC GPUs.

Danyal-sab · 2023-09-22T07:00:49Z

@cj-mills,
Thanks for updating the tutorial.
Just a minor change is needed for the "Update PyTorch Imports" section:
As you provided in the sample codes, on of the import lines, (from torcheval.tools import get_module_summary) should be replaced with this:

from torchtnt.utils import get_module_summary

Thanks again for your great help

cj-mills · 2023-09-22T07:20:26Z

@Danyal-sab
Thanks for catching that!

Danyal-sab · 2023-10-04T08:36:38Z

@cj-mills,
after upgrading to the newer version it worked well. Yesterday I updated the gpu drivers too (as ubuntu offers available software update when they are available). After that the performance dropped significantly. It take almost three times to run the same code before updating the drivers.
have you tested that?

cj-mills · 2023-10-04T17:40:49Z

@Danyal-sab
I don't run the Arc GPU as my daily driver, so I have not used it for nearly a month. I was not planning to install it back into my desktop until Intel's PyTorch extension gets a new update.

It sounds like a similar performance difference to not having the IPEX_XPU_ONEDNN_LAYOUT environment variable set. I don't know if that's related to your issue, but maybe try setting that environment variable to 0 and 1 to see if it impacts performance.

It might also just be a bad driver update. Can you roll back to the previous driver version?

Danyal-sab · 2024-01-06T02:47:15Z

@cj-mills,
Thanks again for your support.
That time I went back to previous version, however as lately there has been an upgrade for the extension, I decided to try it again.
After a quick search in the web it seems to me that this extension still does not fully support python 3.11.
And probably that is the issue, what do you think?

cj-mills · 2024-01-06T03:06:03Z

@Danyal-sab,
I've been meaning to go in-depth with the most recent release of the extension and Intel's BigDL-LLM library, but I have not had time yet.

I briefly swapped in the Arc card a couple of weeks ago, and the training notebooks that worked in the previous versions no longer produced usable models. It was the same issue I described here, but it occurred even with the baseline image classification notebook.

I think I tried with Python 3.9, 3.10, and 3.11, and I had the same issue with all of them. I did not have time to investigate, so I held off making a post about it.

Danyal-sab · 2024-01-06T03:13:15Z

@cj-mills,
thanks for your response.
Are you going to try with the newest version sometime soon?

cj-mills · 2024-01-06T03:14:59Z

@Danyal-sab,
That was with version 2.1.10+xpu. I don't currently know if the source of the issue is the extension or the oneAPI Base Toolkit (or both).

Danyal-sab · 2024-01-06T03:24:07Z

@cj-mills,
you are right. 2.1.10+xpu doesn't seem to be the stable version yet, as it can seen in the repository recommends 2.0.110+xpu for install.
https://github.com/intel/intel-extension-for-pytorch

Danyal-sab · 2024-01-06T03:51:21Z

@cj-mills,
Alright, then.
I am still using version 1.13.0a+xpu
Do you thing it makes sense for me to move to 2.0.110+xpu? and if so shall I use python 3.10 or 3.11?

contryboy · 2024-02-22T21:11:59Z

Hi @cj-mills and everone,
First of all thanks for the doc.
I am able to install everything as mentioned but with latest versions (of oneAPI Base Toolkit and python packages), and run the notebook it is as faster as your example (around 12 minutes). How ever, the accuracy stopped around 0.18 and not improved further even after 3 epochs.
I also tried changing following line

model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16)
to
model, optimizer = ipex.optimize(model, optimizer=optimizer)

But it does not make the things better.
Do you have any idea what could be the problem?

Thanks in advance!

cj-mills · 2024-02-22T22:16:19Z

Hi @contryboy,
Your experience matches my brief testing of the v2.1.10+xpu release. I did not have time to investigate the issue further, so I did not make a post about it. It's the same issue I described with the 2.0.110+xpu release for my YOLOX training notebook. However, with v2.1.10+xpu, it occurred even with the baseline image classification notebook.

I have not had a chance to investigate the source of the issue, but I plan to give it another shot when the next xpu release comes out.

contryboy · 2024-02-23T14:08:32Z

Hi @cj-mils,
Thanks for quick reply. I also tried the sample code in their official doc [1], it has the same problem. So I created an issue [2] in their github project, see if there would be any findings.

[1] https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/examples.html#float32
[2] intel/intel-extension-for-pytorch#537

cj-mills · 2024-02-23T22:02:06Z

@contryboy Nice! It would certainly be more convenient for me if they resolved the issue for the next release.

Danyal-sab · 2024-06-23T23:51:33Z

Hi @cj-mills,
Are you planning to update the tutorial with the latest version?

cj-mills · 2024-06-27T22:58:12Z

@Danyal-sab, I will when I have enough time to swap my Arc card into my desktop and test the latest version. I've been too busy with work projects lately to swap out my NVIDIA card.

Danyal-sab · 2024-10-30T01:47:00Z

Many thanks @cj-mills,
The updated version looks great.
like always well done.

54108 · 2024-11-26T13:27:52Z

hello,it's glad to see that torch 2.5 preview has been released,with native xpu support

54108 · 2024-12-11T15:40:51Z

hi,@cj-mills
it's glad to see that torch 2.5 preview has been released,with native xpu support.
would you like to update the instrustion?

vampireLibrarianMonk · 2025-01-03T14:00:44Z

Have you encountered the 4gb issue noted in

intel/intel-extension-for-pytorch#325

I cannot use the bigger models due to this issue even though the arc770 has 16 gb.

cj-mills · 2025-01-11T23:25:08Z

Hi @vampireLibrarianMonk,

I have not encountered the 4GB issue, but that might just be a matter of what models I've tested. I have not tried to replicate the issue on my card. I can try when I have some time.

cj-mills · 2025-01-26T21:52:16Z

Hey @vampireLibrarianMonk, I finally had time to set up a separate computer with the A770 and run the test case in the GitHub issue you linked to.

Running the test case using Intel's Pytorch extension produces the following RuntimeError:

x = torch.rand(46000, 46000, dtype=torch.float32, device='xpu')

[WARNING] Failed to create Level Zero tracer: 2013265921

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[3], line 1
----> 1 x = torch.rand(46000, 46000, dtype=torch.float32, device='xpu')

RuntimeError: Current platform can NOT allocate memory block with size larger than 4GB! Tried to allocate 7.88 GiB (GPU  0; 15.11 GiB total capacity; 0 bytes already allocated; 0 bytes reserved in total by PyTorch)

Using the preview version 2.6.0+xpu for PyTorch, I get the following OutOfMemoryError:

x = torch.rand(46000, 46000, dtype=torch.float32, device='xpu')

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[6], line 1
----> 1 x = torch.rand(46000, 46000, dtype=torch.float32, device='xpu')

OutOfMemoryError: XPU out of memory. Tried to allocate 7.88 GiB. GPU 0 has a total capacity of 15.11 GiB. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. Please use `empty_cache` to release all unoccupied cached memory.

x = torch.rand(33000, 32600, dtype=torch.float32, device='xpu')

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[4], line 1
----> 1 x = torch.rand(33000, 32600, dtype=torch.float32, device='xpu')

OutOfMemoryError: XPU out of memory. Tried to allocate 4.01 GiB. GPU 0 has a total capacity of 15.11 GiB. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. Please use `empty_cache` to release all unoccupied cached memory.

vampireLibrarianMonk · 2025-01-26T23:08:02Z

This issue is currently being worked on via

intel/intel-extension-for-pytorch#325

Updates to the discussion are happening at least weekly but no concrete solution yet.

posts/intel-pytorch-extension-tutorial/native-ubuntu/ #38

posts/intel-pytorch-extension-tutorial/native-ubuntu/ #38

Comments

utterances-bot commented Jul 3, 2023

Christian Mills - Getting Started with Intel’s PyTorch Extension for Arc GPUs on Ubuntu

Danyal-sab commented Jul 4, 2023

cj-mills commented Jul 4, 2023

Danyal-sab commented Jul 4, 2023 • edited Loading

Danyal-sab commented Jul 7, 2023

cj-mills commented Jul 7, 2023 • edited Loading

ricable commented Jul 9, 2023

psmgeelen commented Jul 9, 2023

cj-mills commented Jul 10, 2023

Danyal-sab commented Jul 12, 2023 • edited Loading

psmgeelen commented Jul 12, 2023

Danyal-sab commented Sep 21, 2023

cj-mills commented Sep 21, 2023

Danyal-sab commented Sep 22, 2023

cj-mills commented Sep 22, 2023

Danyal-sab commented Oct 4, 2023

cj-mills commented Oct 4, 2023

Danyal-sab commented Jan 6, 2024

cj-mills commented Jan 6, 2024 • edited Loading

Danyal-sab commented Jan 6, 2024 • edited Loading

cj-mills commented Jan 6, 2024 • edited Loading

Danyal-sab commented Jan 6, 2024

Danyal-sab commented Jan 6, 2024

contryboy commented Feb 22, 2024

cj-mills commented Feb 22, 2024

contryboy commented Feb 23, 2024

cj-mills commented Feb 23, 2024

Danyal-sab commented Jun 23, 2024

cj-mills commented Jun 27, 2024

Danyal-sab commented Oct 30, 2024

54108 commented Nov 26, 2024

54108 commented Dec 11, 2024

vampireLibrarianMonk commented Jan 3, 2025

cj-mills commented Jan 11, 2025

cj-mills commented Jan 26, 2025

vampireLibrarianMonk commented Jan 26, 2025

Danyal-sab commented Jul 4, 2023 •

edited

Loading

cj-mills commented Jul 7, 2023 •

edited

Loading

Danyal-sab commented Jul 12, 2023 •

edited

Loading

cj-mills commented Jan 6, 2024 •

edited

Loading

Danyal-sab commented Jan 6, 2024 •

edited

Loading

cj-mills commented Jan 6, 2024 •

edited

Loading