Using cached queue instead of creating new one on type inference #946

AlexanderKalistratov · 2023-03-02T16:53:21Z

Using queue cached by dpctl for USMNdArray type.

Potentially fixes #945

fcharras · 2023-04-05T14:45:57Z

TY for trying to solve that @AlexanderKalistratov . If I understand correctly what is going on in this PR, yes I think this should work, now that filter_string is being cached on the dpctl side (as long as the caching is done at the queue level).

Let me just point out that this PR fixes two of the three expensive calls that I had found, the third is this one https://github.com/IntelPython/numba-dpex/blob/main/numba_dpex/core/utils/suai_helper.py#L138 but fixing it looks simpler, since the data defined at this line doesn't seem to be actually used anywhere else (maybe it was added for debug purposes or made available for later work ?), so it can just be replaced with device = syclobj.sycl_device.

I'm giving up trying to solve this with monkey patching, there's too much going on to have something clean enough this way, I'll just going to report performance with the patch from this PR and run some profiling.

fcharras · 2023-04-05T17:09:56Z

So I've ran tests with this branch and latest dpctl and I can confirm that it's a great progress toward reducing numba_dpex overhead ! The overhead is almost all gone. It would be very cool if this PR or likewise can make its way to merge.

almost is enough to make it useful again, and not look obviously slow on the edge dev cloud, but there's still a visible overhead, and I think it'll require huge amount of data to make it negligible compared to actual gpu compute. For small-ish loads it shows in the benchmark.

For instance we just merged a top k, and a run has about 200ms python overhead with numba_dpex and near 0 overhead with numpy, benchmarks show that we can scan 500 million values more than ten times faster than numpy.partition does (6sc vs 0.5sc) but half of the time on gpu is overhead in numba_dpex. See the attached speedscope file (the regular patterns in the second half of the speedscope record are where the top k calls actually happen):

speedscope_topk_runs.zip

(nb: unzip and load it in www.speedscope.app)

It would be awesome if, once a kernel is compiled, there's a user option that enable skipping all input validation on the python side and just submit the kernel to the queue with the shortest possible path.

To sum up I think this PR is a very nice improvement but could we keep another issue opened to keep improving ?

fcharras · 2023-04-06T09:11:55Z

It would be awesome if, once a kernel is compiled, there's a user option that enable skipping all input validation

Thinking about it, kernel specialization features that already exist combined with #963 (and future work to expose it to the user) should enable skipping unpacking steps for each call already.

AlexanderKalistratov · 2023-04-10T21:34:58Z

@diptorupd could you please review this PR?

numba_dpex/core/types/usm_ndarray_type.py

fcharras · 2023-04-15T20:26:22Z

Just chiming in so that this additional call to .filter_string is not forgotten in this PR (could be removed) 🙏

AlexanderKalistratov · 2023-04-18T22:08:16Z

Just chiming in so that this additional call to .filter_string is not forgotten in this PR (could be removed) 🙏

@fcharras thanks for paying attention to the details and helping to investigate this performance overhead issue. You inputs are very useful.
I've performed some tests and didn't found this specific line produces significant overhead with newer dpctl and dpcpp packages.
Moreover I found that this PR is no longer needed to fix performance overhead as it is disappeared. Most likely everything was fixed by caching filter_string. This PR would be merged anyway, but more like refactoring than performance optimization.

I've tried to measure numba function and kernel calling overhead on my laptop and here is the results:

import dpnp
import time
import numba_dpex

@numba_dpex.dpjit()
def foo(a, b=None, c=None):
    return 0

a = dpnp.empty(1)
for i in range(10):
    start = time.time()
    foo(a)
    end = time.time()
    print(1000*(end - start))

Output:

71.18582725524902
0.13828277587890625
0.05984306335449219
0.03719329833984375
0.02384185791015625
0.04100799560546875
0.023365020751953125
0.041484832763671875
0.023126602172851562
0.030517578125

So, overhead of calling (almost) empty numb a function is 20-60 MICROseconds (after second call).
This overhead however linearly grows as we pass more arrays as parameters:

foo(a, a, a)

Output:

74.2654800415039
0.1761913299560547
0.07319450378417969
0.07605552673339844
0.11277198791503906
0.08869171142578125
0.07390975952148438
0.08440017700195312
0.08058547973632812
0.11777877807617188

Now overhead increased up to 70-120 MICROseconds.
This overhead almost completely consists of numba types creations (USMNdArray)
It could be a problem for functions which takes about 1-2 milliseconds and accepts a lot of dpctl arrays as input (e.g. something like DL convolution). But at the moment it is not biggest issue.

I've also tried numba-dpex @kernel. And results are a bit worse:

import dpnp
import time
import numba_dpex
from numba_dpex import Range,NdRange

@numba_dpex.kernel
def bar(a):
    a[0] = 0

a = dpnp.empty(1)
for i in range(10):
    start = time.time()
    bar[Range(1)](a)
    end = time.time()
    print(1000*(end - start))

Outputs:

170.96805572509766
0.9398460388183594
0.7669925689697266
0.7381439208984375
0.6814002990722656
0.7398128509521484
0.5812644958496094
0.6299018859863281
0.7305145263671875
0.7123947143554688

So, it is 0.6-0.8 milliseconds overhead. This overhead also grows with number of input arrays:

bar[Range(1)](a, a, a)

Outputs:

180.30858039855957
1.474618911743164
0.8640289306640625
1.0759830474853516
0.9913444519042969
1.0266304016113281
0.8571147918701172
0.9262561798095703
0.7903575897216797
1.0187625885009766

Now it is 0.8-1.1 ms.
It is unpleasant and should be investigated. This could be sensitive for small kernels, but I'm not sure if this could significantly affects kernel which runs 0.5 seconds.

So if you see overhead of 200ms, could you please provides steps to reproduce the issue, so I would investigate it?

diptorupd

Thank you! I will merge as soon as the dpctl.get_device_cached_queue is available on our internal CI servers.

oleksandr-pavlyk · 2023-04-19T03:02:34Z

I have just tagged 0.14.3dev1 and merged that to gold/2021

fcharras · 2023-04-19T09:59:29Z

TY for the more rigorous benchmarking @diptorupd

If the latest fixes on dpctl fixed all the overhead already that's super cool.

About the overhead of 200ms I've (very roughyl) estimated, I was not refering to a single kernel call overhead but to an aggregated overhead over up to 15 iterations of 4 dpex.kernel calls with respectively 1, 1, 7 and 9 input arrays, that are scripted together to implement a partition reduction primitive. I feel the ms-magnitude overhead of a single kernel calls can definitely add up to significant numbers in end-of-line functions 🤔 (relatively to the total execution time), where you typically have one main kernel with most of the compute, and several support kernels that perform simple ops like summing over an axis or initializing values.

I agree that it might still be negligible in downstream application, I will report if I see actual issues in real world downstream use.

fcharras · 2023-04-19T20:16:28Z

Also, some environments are more likely to show overhead, like dev cloud. If filter_string is not properly cached in syclobj.sycl_device.filter_string it might not show on a bare metal environment but will show on intel dev cloud.

Using cached queue instead of creating new one on type inference 0915170

fcharras · 2023-04-27T11:54:03Z

FYI I re-ran benchmarks from main branch after this merge, along with dpctl==0.14.3dev1 and confirm that performance is OK.

AlexanderKalistratov requested review from mingjie-intel and diptorupd as code owners March 2, 2023 16:53

AlexanderKalistratov force-pushed the use_of_cached_queue branch from 8f23f93 to 93c0418 Compare March 2, 2023 16:56

AlexanderKalistratov mentioned this pull request Mar 2, 2023

Huge overhead on devcloud linked to dpctl calls #945

Closed

AlexanderKalistratov force-pushed the use_of_cached_queue branch from 56bac08 to dc0529a Compare March 3, 2023 17:14

AlexanderKalistratov force-pushed the use_of_cached_queue branch from dc0529a to 9d8ae9e Compare April 4, 2023 18:02

diptorupd suggested changes Apr 14, 2023

View reviewed changes

AlexanderKalistratov force-pushed the use_of_cached_queue branch from 1d14e97 to e7ea064 Compare April 18, 2023 22:18

diptorupd approved these changes Apr 19, 2023

View reviewed changes

chudur-budur self-assigned this Apr 20, 2023

diptorupd force-pushed the use_of_cached_queue branch from 05f14f7 to 71e605a Compare April 21, 2023 19:03

AlexanderKalistratov added 3 commits April 21, 2023 14:12

Using cached queue instead of creating new one on type inference

eeb4f8a

Avoid creating SyclDevice from filter_string

294b15b

Applying review comments

c81f816

chudur-budur force-pushed the use_of_cached_queue branch from 71e605a to c81f816 Compare April 21, 2023 19:16

diptorupd merged commit 0915170 into IntelPython:main Apr 21, 2023

github-actions bot added a commit that referenced this pull request Apr 21, 2023

Merge pull request #946 from AlexanderKalistratov/use_of_cached_queue

8ad9e25

Using cached queue instead of creating new one on type inference 0915170

AlexanderKalistratov mentioned this pull request Jun 14, 2023

Overhead on running dpex kernels #1070

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using cached queue instead of creating new one on type inference #946

Using cached queue instead of creating new one on type inference #946

AlexanderKalistratov commented Mar 2, 2023 •

edited

Loading

fcharras commented Apr 5, 2023

fcharras commented Apr 5, 2023

fcharras commented Apr 6, 2023

AlexanderKalistratov commented Apr 10, 2023

fcharras commented Apr 15, 2023

AlexanderKalistratov commented Apr 18, 2023

diptorupd left a comment

oleksandr-pavlyk commented Apr 19, 2023

fcharras commented Apr 19, 2023 •

edited

Loading

fcharras commented Apr 19, 2023 •

edited

Loading

fcharras commented Apr 27, 2023

Using cached queue instead of creating new one on type inference #946

Using cached queue instead of creating new one on type inference #946

Conversation

AlexanderKalistratov commented Mar 2, 2023 • edited Loading

fcharras commented Apr 5, 2023

fcharras commented Apr 5, 2023

fcharras commented Apr 6, 2023

AlexanderKalistratov commented Apr 10, 2023

fcharras commented Apr 15, 2023

AlexanderKalistratov commented Apr 18, 2023

diptorupd left a comment

Choose a reason for hiding this comment

oleksandr-pavlyk commented Apr 19, 2023

fcharras commented Apr 19, 2023 • edited Loading

fcharras commented Apr 19, 2023 • edited Loading

fcharras commented Apr 27, 2023

AlexanderKalistratov commented Mar 2, 2023 •

edited

Loading

fcharras commented Apr 19, 2023 •

edited

Loading

fcharras commented Apr 19, 2023 •

edited

Loading