-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge overhead on devcloud linked to dpctl calls #945
Comments
To avoid confusion, the |
The relevant calls to investigate here are the cells that are closer to the bottom, since it's as large as the parent cells, it means it's the bottleneck. By hovering on those cells you can see the filename and the line number. You should be able to trace it back to the 3 lines I've linked in the OP. |
Workaround here |
Unfortunately, it doesn't fix. Looking at the PR it doesn't seem to change the instructions that lead to the time consuming steps in the OP (that are, |
The workaround I've posted yesterday doesn't work either. (currently fixing) |
The Construction of SYCL devices may thus be expensive, as RT must talk to the hardware. This suggests that using |
You are correct. We should extract device from |
I would argue that the need to store |
@oleksandr-pavlyk |
I've fixed the monkey-patching workaround given in a previous comment. This should work https://github.com/soda-inria/sklearn-numba-dpex/blob/e040e78d2a5492d7b7b0ec79c2576f0df15cb9db/sklearn_numba_dpex/patches/load_numba_dpex.py#L44 (edit: seems to work. I'd argue that the draft caching mechanism that is outlined in this hack might have some value for |
This even also (almost?) entirely fixes the remaining small overhead that we also noticed even on iGPUs on laptops after the caching overhaul. (pointed out at in #886 (comment)) So, this issue is exacerbated on the intel edge devcloud, but also noticeable on more ordinary hardware. |
Absolutely. #930 Using the filter string for compute follows data and having it part of any type signature (DpnpNdArray or SyclQueue) is a no go. I only did it as a stop gap under time pressure.
Sure, but that has nothing to do with adding it to any type signature. Moreover, it is conceivable that advanced programmers will target sub-devices and have much finer gain control. For such cases, a filter string is not supported by SYCL.
I agree, but given the performance overhead of generating a filter string it is not possible. We can perhaps add backend and device type as string attributes for ease of reading typemaps and such. It is the generation of device number that kills performance. |
It has. Numba caches compiled functions based on input types. Types are described by signatures. Types with equal signature are considered to be equal. Not having device in type signature means Numba wouldn't know to which device function should be compiled.
I really don't see any problem in caching filter string for the device. You need to generate it only once for the created device. In python (not sure about cython) it is a single line fix.
Ok. That means we would need another human friendly text representation on sycl devices/sub devices. But I really don't think that it is Numba-dpex who should be responsible for this. |
FYI, I have added caching for |
@oleksandr-pavlyk this is half of the fix for this issue I think ? The remaining issue is that since the cache key is a device instance, the cache is not shared for distinct arrays or queues. Would that be possible that all arrays share the same device instance (i.e having |
@fcharras Could you please try #946 again? I've updated it according to your comment and I think with IntelPython/dpctl#1127 it should solve the problem. I'm not sure if IntelPython/dpctl#1127 is already on |
I'll look more into that today and reach back. |
Version: numba_0.20.0dev3 and main
The three following dpctl calls 1 2 3 have huge wall time on edge devcloud (measured ranging from 10 to 30ms each call by
py-spy
, see speedscope report):On the devcloud this add about 80 seconds to the k-means benchmark (for an expected 10 seconds).
I didn't see the issue on a local machine, but maybe the remaining small overhead that we reported comes from there.
@oleksandr-pavlyk not sure if this should be considered as an unreasonable use in
numba_dpex
(those calls should be expected to be that long and cached ?) or a bug indpctl
.I've experimenting with caching the values and can confirm that caching those 3 calls completely remove the overhead.
Regarding the scope of the cache, I'll check if a hotfix that consists in storing those value in a
WeakKeyDictionary
where keys areval
, andusm_mem
, and wrappingSyclDevice(device)
call in a lru_cache, is enough. (if so, will monkey-patch insklearn_numba_dpex
in the meantime).The text was updated successfully, but these errors were encountered: