-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low Mixed Precision Performance #296
Comments
I did some more digging around. It doesn't seem to be a problem with mixed-precision per se, more a performance problem in general. Using the profiler, for FP32, I found that the But if we compare again with a Tesla T4 on Colab, equivalent operators are faster on the A770, except for those two that are problematics. Out of curiosity, I tried to run the same code on CPU and it took only 7.5s (compared to 5.75s on XPU), but the most surprising is that the |
Hi @fredlarochelle , Thanks for reporting. I have already faced this issue. We are working on it.
|
Just checked, it's nothing to do with synchronize() and seems like not even Each data batch from CPU to XPU is taking a significant amount of time. due to |
Using Yeah, |
That sample code just means to show usage of IPEX APIs, didn't mean to compare performance. |
As @fredlarochelle said aten::copy_ is extremly slow The bottleneck seems to be at GPU to GPU transfer Example code testing GPU to GPU memory copy
It bottlenecks at 100 GB/s max transfer rate when the A770 bandwidth is 512 GB/s. Also, the most important thing is, transferring low amounts of data is extremely slow because small batches only achieves 30 GB/s which is 6% utilization of the maximum theoretical bandwidth of 512 GB/s.
|
Also matrix multiplication seems to be memory bandwidth limited
TFLOPS in matrix multiplication steadily increases until 30000 by 30000 matrix and suddenly decreases. I am guessing matrix multiplication is limited by memory bandwidth and there are cache misses when fitting large matrices into the VRAM
|
@BA8F0D39 Quick tip, when possible, creating the matrix on XPU will make everything run wayyy faster.
For me, doing this, your TFLOPS program runs about 30x faster. |
@fredlarochelle Also, does your A770 GPU allow you to allocate bf16 memory above 8GB? My code crashes on the A770 if you try to allocate bf16 memory above 8GB. I have a A770 16 GB card. |
No, like with CUDA, And I can also confirm that I can't seem to allocate over 8GB at all on the A770 16GB (with fp32 and bf16) I get the following error |
@jingxu10 |
No, PyTorch does not. It should be driver limits. |
It seems that some of the operations are using a slow code path. If channels_last is not enabled, convolution forward and convolution weight backward do not use blocked formats and use a slow version (see dnnl_normal.log). |
@lit199
Kernel Latency of 34.76 us in A770 16 GB is 10x larger than RTX 2080 SUPER of 3.46 us |
@BA8F0D39 Slow kernel launch is bad but is not the main culprit here. IPEX/PyTorch/oneDNN is choosing a slower version when there is a faster version available. Also I am getting 8.97us
@jingxu10 Setting IPEX_XPU_ONEDNN_LAYOUT=1 seems to add the necessary reorder steps. I am getting 4-5 batches/s after this change. Here is some onednn log after the change. |
@lit199
|
Any updates on slow GPU-GPU transfer speed? It's way better than it was, but it's still 3-5 slower than a Tesla T4 from Colab for transfer smaller than around 0.01-0.12 GB/s. Over that the performance of the A770 is around where it should be, a bit under double the speed of the T4, when comparing the theorical performance in memory speed of each GPU. The performance is also significantly slower on A770 for a single transfer. Finally, comparing FP16/BF16 with FP32 on A770, FP32 is about half as slow, except for small transfer where it's about the same. Also, for the weird 4GB/8GB memory issues, I have further isolated the problem. I don't have the
Take note that each "section of code" is running through a restarted kernel, especially since the For gibberish for an array over 4GB, see #325. |
@BA8F0D39 @fredlarochelle We verified your case locally, got similar mem bandwidth for BF16, ~100GB/s. And we also tried Float, ~200GB/s. And clpeak seems showing 400GB/s. The issue might be caused instruction bound. Data from memory access instructions cannot feed DDR/HBM well. We are checking all stacks for SIMD enabling on ARC. |
@arthuryuan1987 I get an higher bandwidth than that for BF16 (same for FP16), ~200GB/s (peak of 222GB/s) for transfers over ~0.2GB/s (under that it's lower and drastically lower under ~0.01-0.12 GB/s). On my system, it's FP32 that maxes out at ~110GB/s. For FP16 and BF16 and bigger transfer, the "percentage of the theorical max bandwith" I get is about the same as a Tesla T4 on Colab. The problems are really with smaller transfers and FP32. Also, single transfers are really slow, |
@fredlarochelle |
@arthuryuan1987 |
@BA8F0D39 Can confirm, I get around 1s per epoch too, but still way better than it was in February. And the performance is also definitely better with bigger datasets/bigger memory transfers and using some optimization (this simple example in the doc is not optimized at all), but the performance is probably still a bit under one order of magnitude off for this particular example. Tesla T4 theorical specs are 65.13 TFLOPS at FP16 with 320 GB/s of memory bandwith vs the A770 157.23 TFLOP/s for FP16/B16 (65536 ops per clock cycle) and 560 GB/s. So, it can possible be even more than 2x a T4. For multiple workers, I have no problem with setting |
|
@BA8F0D39 |
I suppose you are saying using multi-threading to feed memory port. Your intention is right. To make memory port fed well,
Regarding cases here, your approach should not work,
|
@fredlarochelle @arthuryuan1987
On A770 16 GB
|
I am encountering some strange performance behavior on the A770. For example, taking the CIFAR-10 example in the documentation.
Using FP32, I get around 5.75s per epoch and, using BF16, I get around 6.2s per epoch. I also get the same exact performance with and without
ipex.optimize()
.Also, when I compare the performance with a Tesla T4 on Colab, in FP32, it runs each epoch in around 1s and, for FP16, around 0.25s. Wayy faster and the A770 has technically better specs...
Are the XMX engines being used on Arc GPUs? Yes #258?
Dunno, if it might be related, but I get the following warnings when running the example (EDIT I started going through the code in the repo, those warnings are not related to the current issue):
Ubuntu 22.04 with 1.13.10+xpu.
The text was updated successfully, but these errors were encountered: