-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dispatcher/caching rewrite to address performance regression #912
Dispatcher/caching rewrite to address performance regression #912
Conversation
Thanks @adarshyoga for triaging it. The Can you try making the func_hash optional based on a flag? That can save us some cycles. |
) | ||
if not key: | ||
stripped_argtypes = self._strip_usm_metadata(argtypes) | ||
codegen_magic_tuple = kernel.target_context.codegen().magic_tuple() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I compared with 0.19.0, even this extra hashing on the codegen_magic_tuple may be adding an overhead. Can you try removing this and just adding the kernel to the key like we had in 0.19
…aching.build_key(). This avoids computing hash on every call. (2) moved argtypes list building logic to func.py and dispatcher. Again, avoids list building on every call; (3) Rewrote build_key to take variable args and return tuple. (4) Removed unnecessary call to LRUCache.get() inside LRUCache.put()
…ng functions to a separate cache utils. (2) added docstrings. (3) Replaced get() with explicit logic to update list in LRUCache
66d3a82
to
1056a58
Compare
LGTM! I manually restarted those stuck jobs at teamcity, will merge as soon as we get a pass on those CIs. |
Dispatcher/caching rewrite to address performance regression ae994cd
This PR partially addresses the performance regressions described in #886. It contains the following 4 key changes.
(1)
The put() function implementation in LRUCache contains a call to get(), which was unnecessary. This PR removes the call to get().Replaced call to get() in LRUCache.put() method with explicit logic to update the linked list keeping track of LRU ordering.(2) Sha256 hash computation was being performed on every call to build cache key, which in turn was being called on for each dynamic call to a kernel. Instead of computing hash on every dynamic call to a kernel, it can be done once. This PR performs the hash computation for every static instance of a kernel rather than every dynamic instance.
(3) The types of kernel arguments are used as a part of the key when caching. The arguments types were being pre-processed to strip out USM metadata. This functionality was being performed twice for each call, once for building key for kernel module cache and again for kernel bundle cache. This PR changes the logic to perform the pre-processing once per-call.
(4) The function that build the cache key takes variable number of arguments and returns a tuple. The rest of the logic from build key has been moved to dispatcher and func. The key intuition of using variable args is to support the different caches, so far, kernel module cache and kernel bundle cache. Both these caches use different number of keys. (Side note: The build_key function is ideally suited to exist as a static method inside AbstractCache class).
Effects of optimizations:
Evaluated the effect of these changes using kmeans implementation from here.
On Intel ATS GPUs the execution time before the changes is 1.7 seconds. After these changes the execution time reduces to 1.4 seconds. See log below. With numba-dpex 0.19.0 the execution time is 1.1 seconds. These changes partially address the regression introduced after 0.19.0.
Run Log with numba-dpex 0.19.0:
python benchmark/kmeans.py
Running Kmeans numba_dpex lloyd GPU ... done in 1.1 s
Run Log with numba-dpex main:
python benchmark/kmeans.py
Running Kmeans numba_dpex lloyd GPU ... done in 1.7 s
Run Log with this PR:
python benchmark/kmeans.py
Running Kmeans numba_dpex lloyd GPU ... done in 1.4 s