It takes more than 100ms to issue a single command to Intel Arc GPU #386

BA8F0D39 · 2023-07-04T18:23:23Z

Describe the issue

Printing a float32 takes 1340 us in IPEX. This is fine.
However, transferring a single float32 number takes 0.142 s in Intel Arc A770 16 GB. Why does this take so long? The GPU to GPU transfer rate is 224.56 bit/s for 1 float32.

For reference, RTX 3090 takes 0.000359s to transfer a single float32 number.

import time
import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)


x = torch.rand(1, 1, dtype=torch.float32, device='xpu')

torch.xpu.synchronize()
start = time.time()
print(x.cpu())
end = time.time()

print("Print Time in Seconds: %.20f " % (end - start))







torch.manual_seed(2)

x = torch.rand(1, 1, dtype=torch.float32, device='xpu')
y = torch.rand(1, 1, dtype=torch.float32, device='xpu')

torch.xpu.synchronize()
start = time.time()
y = x.clone()
print(y.cpu())
end = time.time()

print("Data Transfer Time in Seconds: %.20f " % (end - start))

Pytorch takes takes 0.142s to issue 1 command on Intel Arc A770 16 GB

tensor([[0.9179]])
Print Time in Seconds: 0.00134086608886718750 
tensor([[0.9696]])
Data Transfer Time in Seconds: 0.14255475997924804688

Pytorch takes takes 0.000359s to issue 1 command on RTX 3090

tensor([[0.3990]])                                                          
Print Time in Seconds: 0.00103116035461425781                               
tensor([[0.4254]])                                                               
Data Transfer Time in Seconds: 0.00035905838012695312

clinfo

Platform: Intel(R) OpenCL HD Graphics
  Device: Intel(R) Arc(TM) A770 Graphics
    Driver version  : 23.05.25593.18 (Linux x64)
    Compute units   : 512
    Clock frequency : 2400 MHz

    Global memory bandwidth (GBPS)
      float   : 391.40
      float2  : 403.59
      float4  : 406.54
      float8  : 418.51
      float16 : 422.83

    Single-precision compute (GFLOPS)
clCreateBuffer (-61)
      Tests skipped

    Half-precision compute (GFLOPS)
      half   : 19570.87
      half2  : 19509.20
      half4  : 19540.56
      half8  : 19455.51
      half16 : 19330.61

    No double precision support! Skipped

    Integer compute (GIOPS)
clCreateBuffer (-61)
      Tests skipped

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 17.11
      enqueueReadBuffer          : 8.06
      enqueueMapBuffer(for read) : 19.89
        memcpy from mapped ptr   : 22.40
      enqueueUnmap(after write)  : 23.53
        memcpy to mapped ptr     : 22.48

    Kernel launch latency : 6.07 us

The text was updated successfully, but these errors were encountered:

fengyuan14 · 2023-07-05T01:06:05Z

The measurement of GPU to GPU here is synchronized. If you expect GPU performance data, please use profiler tool to exclude host computation runtime impact, like,

with torch.autograd.profiler_legacy.profile() as prof:
    torch.clone(A, B)
print(prof.key_averages().table(sort_by="self_xpu_time_total"))

The tool will show you host latency (kernel submission), and asynchronized computation latency on GPU.

BA8F0D39 · 2023-07-06T23:39:52Z

@arthuryuan1987
Why is the Intel Arc A770 395x slower than RTX 3090 for the exact same pytorch code?

fengyuan14 · 2023-07-07T00:41:52Z

I guess your build might not be a AOT build, which brings runtime kernel JIT overhead. And AOT build of NVCC is default on. You may warm up the clone kernel, like,

torch.clone(A, B) # warm up
with torch.autograd.profiler_legacy.profile() as prof:
    torch.clone(A, B)
print(prof.key_averages().table(sort_by="self_xpu_time_total"))

jingxu10 added ARC ARC GPU Performance labels Jul 17, 2023

BA8F0D39 closed this as completed Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It takes more than 100ms to issue a single command to Intel Arc GPU #386

It takes more than 100ms to issue a single command to Intel Arc GPU #386

BA8F0D39 commented Jul 4, 2023 •

edited

Loading

fengyuan14 commented Jul 5, 2023

BA8F0D39 commented Jul 6, 2023 •

edited

Loading

fengyuan14 commented Jul 7, 2023

It takes more than 100ms to issue a single command to Intel Arc GPU #386

It takes more than 100ms to issue a single command to Intel Arc GPU #386

Comments

BA8F0D39 commented Jul 4, 2023 • edited Loading

Describe the issue

fengyuan14 commented Jul 5, 2023

BA8F0D39 commented Jul 6, 2023 • edited Loading

fengyuan14 commented Jul 7, 2023

BA8F0D39 commented Jul 4, 2023 •

edited

Loading

BA8F0D39 commented Jul 6, 2023 •

edited

Loading