You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Printing a float32 takes 1340 us in IPEX. This is fine.
However, transferring a single float32 number takes 0.142 s in Intel Arc A770 16 GB. Why does this take so long? The GPU to GPU transfer rate is 224.56 bit/s for 1 float32.
For reference, RTX 3090 takes 0.000359s to transfer a single float32 number.
import time
import torch
import torchvision.models as models
import numpy as np
import intel_extension_for_pytorch as ipex
torch.manual_seed(0)
x = torch.rand(1, 1, dtype=torch.float32, device='xpu')
torch.xpu.synchronize()
start = time.time()
print(x.cpu())
end = time.time()
print("Print Time in Seconds: %.20f " % (end - start))
torch.manual_seed(2)
x = torch.rand(1, 1, dtype=torch.float32, device='xpu')
y = torch.rand(1, 1, dtype=torch.float32, device='xpu')
torch.xpu.synchronize()
start = time.time()
y = x.clone()
print(y.cpu())
end = time.time()
print("Data Transfer Time in Seconds: %.20f " % (end - start))
Pytorch takes takes 0.142s to issue 1 command on Intel Arc A770 16 GB
tensor([[0.9179]])
Print Time in Seconds: 0.00134086608886718750
tensor([[0.9696]])
Data Transfer Time in Seconds: 0.14255475997924804688
Pytorch takes takes 0.000359s to issue 1 command on RTX 3090
tensor([[0.3990]])
Print Time in Seconds: 0.00103116035461425781
tensor([[0.4254]])
Data Transfer Time in Seconds: 0.00035905838012695312
The measurement of GPU to GPU here is synchronized. If you expect GPU performance data, please use profiler tool to exclude host computation runtime impact, like,
with torch.autograd.profiler_legacy.profile() as prof:
torch.clone(A, B)
print(prof.key_averages().table(sort_by="self_xpu_time_total"))
The tool will show you host latency (kernel submission), and asynchronized computation latency on GPU.
I guess your build might not be a AOT build, which brings runtime kernel JIT overhead. And AOT build of NVCC is default on. You may warm up the clone kernel, like,
torch.clone(A, B) # warm up
with torch.autograd.profiler_legacy.profile() as prof:
torch.clone(A, B)
print(prof.key_averages().table(sort_by="self_xpu_time_total"))
Describe the issue
Printing a float32 takes 1340 us in IPEX. This is fine.
However, transferring a single float32 number takes 0.142 s in Intel Arc A770 16 GB. Why does this take so long? The GPU to GPU transfer rate is 224.56 bit/s for 1 float32.
For reference, RTX 3090 takes 0.000359s to transfer a single float32 number.
Pytorch takes takes 0.142s to issue 1 command on Intel Arc A770 16 GB
Pytorch takes takes 0.000359s to issue 1 command on RTX 3090
clinfo
The text was updated successfully, but these errors were encountered: