Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using cached queue instead of creating new one on type inference #946

Merged
merged 3 commits into from
Apr 21, 2023

Conversation

AlexanderKalistratov
Copy link
Contributor

@AlexanderKalistratov AlexanderKalistratov commented Mar 2, 2023

Using queue cached by dpctl for USMNdArray type.

Potentially fixes #945

@fcharras
Copy link

fcharras commented Apr 5, 2023

TY for trying to solve that @AlexanderKalistratov . If I understand correctly what is going on in this PR, yes I think this should work, now that filter_string is being cached on the dpctl side (as long as the caching is done at the queue level).

Let me just point out that this PR fixes two of the three expensive calls that I had found, the third is this one https://github.com/IntelPython/numba-dpex/blob/main/numba_dpex/core/utils/suai_helper.py#L138 but fixing it looks simpler, since the data defined at this line doesn't seem to be actually used anywhere else (maybe it was added for debug purposes or made available for later work ?), so it can just be replaced with device = syclobj.sycl_device.

I'm giving up trying to solve this with monkey patching, there's too much going on to have something clean enough this way, I'll just going to report performance with the patch from this PR and run some profiling.

@fcharras
Copy link

fcharras commented Apr 5, 2023

So I've ran tests with this branch and latest dpctl and I can confirm that it's a great progress toward reducing numba_dpex overhead ! The overhead is almost all gone. It would be very cool if this PR or likewise can make its way to merge.

almost is enough to make it useful again, and not look obviously slow on the edge dev cloud, but there's still a visible overhead, and I think it'll require huge amount of data to make it negligible compared to actual gpu compute. For small-ish loads it shows in the benchmark.

For instance we just merged a top k, and a run has about 200ms python overhead with numba_dpex and near 0 overhead with numpy, benchmarks show that we can scan 500 million values more than ten times faster than numpy.partition does (6sc vs 0.5sc) but half of the time on gpu is overhead in numba_dpex. See the attached speedscope file (the regular patterns in the second half of the speedscope record are where the top k calls actually happen):

speedscope_topk_runs.zip

(nb: unzip and load it in www.speedscope.app)

It would be awesome if, once a kernel is compiled, there's a user option that enable skipping all input validation on the python side and just submit the kernel to the queue with the shortest possible path.

To sum up I think this PR is a very nice improvement but could we keep another issue opened to keep improving ?

@fcharras
Copy link

fcharras commented Apr 6, 2023

It would be awesome if, once a kernel is compiled, there's a user option that enable skipping all input validation

Thinking about it, kernel specialization features that already exist combined with #963 (and future work to expose it to the user) should enable skipping unpacking steps for each call already.

@AlexanderKalistratov
Copy link
Contributor Author

@diptorupd could you please review this PR?

numba_dpex/core/types/usm_ndarray_type.py Outdated Show resolved Hide resolved
numba_dpex/core/types/usm_ndarray_type.py Outdated Show resolved Hide resolved
numba_dpex/core/types/usm_ndarray_type.py Show resolved Hide resolved
numba_dpex/core/types/usm_ndarray_type.py Outdated Show resolved Hide resolved
numba_dpex/core/types/usm_ndarray_type.py Outdated Show resolved Hide resolved
numba_dpex/core/types/usm_ndarray_type.py Outdated Show resolved Hide resolved
numba_dpex/core/types/usm_ndarray_type.py Outdated Show resolved Hide resolved
numba_dpex/core/types/usm_ndarray_type.py Outdated Show resolved Hide resolved
@fcharras
Copy link

Just chiming in so that this additional call to .filter_string is not forgotten in this PR (could be removed) 🙏

@AlexanderKalistratov
Copy link
Contributor Author

Just chiming in so that this additional call to .filter_string is not forgotten in this PR (could be removed) 🙏

@fcharras thanks for paying attention to the details and helping to investigate this performance overhead issue. You inputs are very useful.
I've performed some tests and didn't found this specific line produces significant overhead with newer dpctl and dpcpp packages.
Moreover I found that this PR is no longer needed to fix performance overhead as it is disappeared. Most likely everything was fixed by caching filter_string. This PR would be merged anyway, but more like refactoring than performance optimization.

I've tried to measure numba function and kernel calling overhead on my laptop and here is the results:

import dpnp
import time
import numba_dpex

@numba_dpex.dpjit()
def foo(a, b=None, c=None):
    return 0

a = dpnp.empty(1)
for i in range(10):
    start = time.time()
    foo(a)
    end = time.time()
    print(1000*(end - start))

Output:

71.18582725524902
0.13828277587890625
0.05984306335449219
0.03719329833984375
0.02384185791015625
0.04100799560546875
0.023365020751953125
0.041484832763671875
0.023126602172851562
0.030517578125

So, overhead of calling (almost) empty numb a function is 20-60 MICROseconds (after second call).
This overhead however linearly grows as we pass more arrays as parameters:

foo(a, a, a)

Output:

74.2654800415039
0.1761913299560547
0.07319450378417969
0.07605552673339844
0.11277198791503906
0.08869171142578125
0.07390975952148438
0.08440017700195312
0.08058547973632812
0.11777877807617188

Now overhead increased up to 70-120 MICROseconds.
This overhead almost completely consists of numba types creations (USMNdArray)
It could be a problem for functions which takes about 1-2 milliseconds and accepts a lot of dpctl arrays as input (e.g. something like DL convolution). But at the moment it is not biggest issue.

I've also tried numba-dpex @kernel. And results are a bit worse:

import dpnp
import time
import numba_dpex
from numba_dpex import Range,NdRange

@numba_dpex.kernel
def bar(a):
    a[0] = 0

a = dpnp.empty(1)
for i in range(10):
    start = time.time()
    bar[Range(1)](a)
    end = time.time()
    print(1000*(end - start))

Outputs:

170.96805572509766
0.9398460388183594
0.7669925689697266
0.7381439208984375
0.6814002990722656
0.7398128509521484
0.5812644958496094
0.6299018859863281
0.7305145263671875
0.7123947143554688

So, it is 0.6-0.8 milliseconds overhead. This overhead also grows with number of input arrays:

bar[Range(1)](a, a, a)

Outputs:

180.30858039855957
1.474618911743164
0.8640289306640625
1.0759830474853516
0.9913444519042969
1.0266304016113281
0.8571147918701172
0.9262561798095703
0.7903575897216797
1.0187625885009766

Now it is 0.8-1.1 ms.
It is unpleasant and should be investigated. This could be sensitive for small kernels, but I'm not sure if this could significantly affects kernel which runs 0.5 seconds.

So if you see overhead of 200ms, could you please provides steps to reproduce the issue, so I would investigate it?

Copy link
Contributor

@diptorupd diptorupd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I will merge as soon as the dpctl.get_device_cached_queue is available on our internal CI servers.

@oleksandr-pavlyk
Copy link
Contributor

I have just tagged 0.14.3dev1 and merged that to gold/2021

@fcharras
Copy link

fcharras commented Apr 19, 2023

TY for the more rigorous benchmarking @diptorupd

If the latest fixes on dpctl fixed all the overhead already that's super cool.

About the overhead of 200ms I've (very roughyl) estimated, I was not refering to a single kernel call overhead but to an aggregated overhead over up to 15 iterations of 4 dpex.kernel calls with respectively 1, 1, 7 and 9 input arrays, that are scripted together to implement a partition reduction primitive. I feel the ms-magnitude overhead of a single kernel calls can definitely add up to significant numbers in end-of-line functions 🤔 (relatively to the total execution time), where you typically have one main kernel with most of the compute, and several support kernels that perform simple ops like summing over an axis or initializing values.

I agree that it might still be negligible in downstream application, I will report if I see actual issues in real world downstream use.

@fcharras
Copy link

fcharras commented Apr 19, 2023

Also, some environments are more likely to show overhead, like dev cloud. If filter_string is not properly cached in syclobj.sycl_device.filter_string it might not show on a bare metal environment but will show on intel dev cloud.

@chudur-budur chudur-budur self-assigned this Apr 20, 2023
@diptorupd diptorupd force-pushed the use_of_cached_queue branch from 05f14f7 to 71e605a Compare April 21, 2023 19:03
@diptorupd diptorupd merged commit 0915170 into IntelPython:main Apr 21, 2023
github-actions bot added a commit that referenced this pull request Apr 21, 2023
Using cached queue instead of creating new one on type inference 0915170
@fcharras
Copy link

FYI I re-ran benchmarks from main branch after this merge, along with dpctl==0.14.3dev1 and confirm that performance is OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Huge overhead on devcloud linked to dpctl calls
5 participants