-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using cached queue instead of creating new one on type inference #946
Using cached queue instead of creating new one on type inference #946
Conversation
8f23f93
to
93c0418
Compare
56bac08
to
dc0529a
Compare
dc0529a
to
9d8ae9e
Compare
TY for trying to solve that @AlexanderKalistratov . If I understand correctly what is going on in this PR, yes I think this should work, now that Let me just point out that this PR fixes two of the three expensive calls that I had found, the third is this one https://github.com/IntelPython/numba-dpex/blob/main/numba_dpex/core/utils/suai_helper.py#L138 but fixing it looks simpler, since the data defined at this line doesn't seem to be actually used anywhere else (maybe it was added for debug purposes or made available for later work ?), so it can just be replaced with I'm giving up trying to solve this with monkey patching, there's too much going on to have something clean enough this way, I'll just going to report performance with the patch from this PR and run some profiling. |
So I've ran tests with this branch and latest almost is enough to make it useful again, and not look obviously slow on the edge dev cloud, but there's still a visible overhead, and I think it'll require huge amount of data to make it negligible compared to actual gpu compute. For small-ish loads it shows in the benchmark. For instance we just merged a top k, and a run has about 200ms python overhead with (nb: unzip and load it in It would be awesome if, once a kernel is compiled, there's a user option that enable skipping all input validation on the python side and just submit the kernel to the queue with the shortest possible path. To sum up I think this PR is a very nice improvement but could we keep another issue opened to keep improving ? |
Thinking about it, kernel specialization features that already exist combined with #963 (and future work to expose it to the user) should enable skipping unpacking steps for each call already. |
@diptorupd could you please review this PR? |
Just chiming in so that this additional call to |
@fcharras thanks for paying attention to the details and helping to investigate this performance overhead issue. You inputs are very useful. I've tried to measure numba function and kernel calling overhead on my laptop and here is the results: import dpnp
import time
import numba_dpex
@numba_dpex.dpjit()
def foo(a, b=None, c=None):
return 0
a = dpnp.empty(1)
for i in range(10):
start = time.time()
foo(a)
end = time.time()
print(1000*(end - start)) Output:
So, overhead of calling (almost) empty numb a function is 20-60 MICROseconds (after second call). foo(a, a, a) Output:
Now overhead increased up to 70-120 MICROseconds. I've also tried import dpnp
import time
import numba_dpex
from numba_dpex import Range,NdRange
@numba_dpex.kernel
def bar(a):
a[0] = 0
a = dpnp.empty(1)
for i in range(10):
start = time.time()
bar[Range(1)](a)
end = time.time()
print(1000*(end - start)) Outputs:
So, it is 0.6-0.8 milliseconds overhead. This overhead also grows with number of input arrays: bar[Range(1)](a, a, a) Outputs:
Now it is 0.8-1.1 ms. So if you see overhead of 200ms, could you please provides steps to reproduce the issue, so I would investigate it? |
1d14e97
to
e7ea064
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I will merge as soon as the dpctl.get_device_cached_queue
is available on our internal CI servers.
I have just tagged 0.14.3dev1 and merged that to gold/2021 |
TY for the more rigorous benchmarking @diptorupd If the latest fixes on About the overhead of 200ms I've (very roughyl) estimated, I was not refering to a single kernel call overhead but to an aggregated overhead over up to 15 iterations of 4 I agree that it might still be negligible in downstream application, I will report if I see actual issues in real world downstream use. |
Also, some environments are more likely to show overhead, like dev cloud. If |
05f14f7
to
71e605a
Compare
71e605a
to
c81f816
Compare
Using cached queue instead of creating new one on type inference 0915170
FYI I re-ran benchmarks from main branch after this merge, along with |
Using queue cached by
dpctl
forUSMNdArray
type.Potentially fixes #945