On making task configuration and task args available to the user without executing #769

fcharras · 2022-08-29T09:01:47Z

It seems the underlying dpctl API supports dispatching SyclKernel asynchronously and numba_dpex.kernel is a few code lines away of making available to the user tasks that can be fed to SyclQueues and mapped to SyclEvents that embeds a DAG of execution dependencies.

This small test script show how it works, the async_iter function here is five time faster that its sync counterpart.

Would you say that this is a correct take and that it could be interesting to use, would there be a case for exposing corresponding public methods ?

The text was updated successfully, but these errors were encountered:

fcharras · 2022-08-30T09:44:33Z

@oleksandr-pavlyk maybe you have insight on this ?

ogrisel · 2022-09-08T12:32:01Z

As discussed IRL, this is related to the problem described in:

Based on my understanding we need to make sure that queue.wait() before Python exits. We could achieve with a try / finally in the users' code. Alternatively we could use the Python atexit module to register a callback whenever we use the dpctl queue directly.

fcharras · 2022-09-21T08:27:14Z

If I understand correctly the relevant paragraph is:

A user has to make sure that all tasks, including the host tasks, complete before exiting the script. To do that the user can explicitly wait on the returned events, or wait directly on the used execution queues.

If all what it takes to avoid issues is ensuring a queue.wait() is there really a limitation here ? it doesn't sound unreasonable to expect from users to wait() if it's well documented, and if not expect chaotic behavior. And (apart from brutal terminations like segfaults and sigterms) it's easy to ensure that a wait is issued at shutdown like @ogrisel pointed out (using try/finally, a context-manager, or atexit,...).

I have a doubt about the meaning of the text that follows:

The event-based design for asynchronous execution is needed due to SYCL queues being out-of-order by default. Ordering tasks in such queues requires the use of events. SYCL also support in-order queues, in which the task execution graph is linear and use of events is not needed. In order to support concurrent execution of SYCL kernels by capable hardware, it becomes necessary to use several independent in-order queues built for the same combination of device and context. Synchronizing tasks in different queues, both in-order and out-of-order can be accomplished using the sycl_ext_oneapi_enqueue_barrier extension exposed in dpctl as dpctl.SyclQueue.submit_barrier method.

If I undertand correctly, taking this function as example:

even though in this example there are two independant DAGs that are submitted to the queue, the kernels are in fact going to be executed sequentially, disregarding the device support for concurrent execution, because there's only one queue. (It still might be interesting performance-wise because the dispatching from the python interpreter to the queue will be concurrent to the execution of the dispatched kernels, thus masking the dispatch overhead, but nothing more)
to unlock concurrent execution those two independant DAGs should be inputed using independant queues
and those queues should have the "in-order" property ?

ogrisel · 2022-11-04T10:33:48Z

even though in this example there are two independant DAGs that are submitted to the queue, the kernels are in fact going to be executed sequentially, disregarding the device support for concurrent execution, because there's only one queue.

Is there a way to confirm whether or not numba_dpex/dpctl use an in-order queue in this case? There is no option to use an out-of-order queue with explicit dependencies on the tasks themselves?

fcharras · 2022-11-04T10:53:47Z

@ogrisel sycl queues have an attribute that says if it is in order or not: https://intelpython.github.io/dpctl/latest/docfiles/dpctl/SyclQueue.html#dpctl.SyclQueue.is_in_order

fcharras · 2022-11-25T10:39:54Z

@diptorupd

Hearing about refactoring in #816 I wondered if you also plan to work towrad this ? If I understand correctly this feature would unlock the equivalent of cuda streams in numba.cuda and there are various issues that refers to it (e.g. #147 )

diptorupd · 2023-05-22T19:26:19Z

@fcharras @ogrisel @oleksandr-pavlyk

The bulk of the internal refactoring that was planned for the kernel API is now in master. PR #1049 is going to remove the support for NumPy arrays as kernel arguments.

Once #1049 is merged, we are free to support returning an event from the kernel submission call. There are certain design questions that should be addressed before that:

How do we ensure the arguments to the kernel are kept alive till the duration of kernel execution. Is it the programmer or the compiler who is responsible?
What happens when the returned event object is not stored in a variable by the user? Should we move the wait call into the SyclEvent destructor?

fcharras · 2023-05-23T09:28:08Z

Thank you for the work @diptorupd

Regarding the design questions here are my thoughts. Those features are interesting for several reasons:

using a custom SyclQueue object rather than letting numba_dpex implicitly decides which SyclQueue a kernel is fed to. This is interesting because different queues can run kernels concurrently.
unlocking the parameters of SyclQueue.submit that are not currently exposed in numba_dpex, currently the only such argument is the dEvents argument, that enable building a computational dependency graph. Then calling a single wait at the end once the graph is registered. This allow kernel execution to run asynchronously with respect to the python interpreter, and it will hide the costs of whatever the python interpreter is doing in the meantime.
and finally, being able to re-use a kernel that has been specialized, with the same inputs, repeatedly, without having to:
- query cache every time (the user expects to re-use the same kernel so computing a hash and querying cache should not be needed)
- go through validation and unpacking of the inputs every time

The cost of those steps is low but not negligible so it's good if it can be removed once a kernel have been specialized.

To avoid mistakes with cache and input types, I think it's fair to limit the API to specialized kernels. Once a kernel have been specialized there's no reason to probe the cache to call the kernel object later on.

Couldn't it be something as simple as:

# `data_parallel_sum2` will have type `SyclKernel` 
@ndpx.kernel([(i64arrty, i64arrty, i64arrty), (f32arrty, f32arrty, f32arrty)], sycl_kernel=True)
def data_parallel_sum2(a, b, c):
    i = ndpx.get_global_id(0)
    c[i] = a[i] + b[i]

a = dpt.ones(1024, dtype=dpt.int64)
b = dpt.ones(1024, dtype=dpt.int64)
c = dpt.zeros(1024, dtype=dpt.int64)

args = data_parallel_sum2.validate_and_unpack(a, b, c)  # needs an utility to manually call input validation once
my_sycl_queue = dpctl.SyclQueue()
my_event = my_sycl_queue.submit(data_parallel_sum2, args, gs=1024, ls=512, dEvent=None)
my_event.wait()

Beside that I think that addressing 1. or 2., by keeping alive inputs as long as a task is not completed, or automatically calling wait on pending events in case of interpreter shutdown (features which probably requires implementing callbacks on task completion), would be dpctl job, and it's fine letting users deal with those issues too if dpctl doesn't provide that, it sounds manageable by users at the python level.

For 1. maybe I fail to see what is at stake here ? Worst case scenario, users could have kernels that read or write arrays that have been de-allocated, and get segfaults (on CPU) or nonsensical results (on GPU). This is already something numba_dpex users will face when writing dpex.kernel kernels anyway, since there's no automated boundaries check on __getitem__ calls in kernels and it seems that gpu kernels can read arbitrary adresses (which if anything is a bit worrying from a security point of view ?).

For 2 I'm not quite sure what is the risk there either ? I've never seen a device task not be cancelled when the interpreter is SIGKILL'd, and for proper termination (SIGTERM) it sounds fine warning the user (with documentation / examples) that he's responsible with calling wait properly (be it in context managers or atexit finalizers...), or else face undefined behavior (maybe segfaults).

diptorupd · 2023-12-10T17:12:39Z

@fcharras @ogrisel Asynchronous kernel submission was a feature we have recently added to numba_dpex.experimental.kernel. The user guide and examples still need to be updated, but you can look at the test case: https://github.com/IntelPython/numba-dpex/blob/main/numba_dpex/tests/experimental/test_async_kernel.py for an example on how to use the feature.

We will migrate the feature to our core module once we have tested it out a bit more and few other related changes are also ready to be migrated.

diptorupd · 2023-12-19T01:15:48Z

@ZzEeKkAa another one that should be marked as done post #1249

ZzEeKkAa · 2023-12-20T00:28:47Z

Implemented with #1249 . Usage doc will be provided at #147

ogrisel mentioned this issue Sep 8, 2022

Anti-pattern in dpex.DEFAULT_LOCAL_SIZE #766

Closed

mingjie-intel added the user User submitted issue label Oct 19, 2022

fcharras mentioned this issue Oct 27, 2022

Extend sum reduction kernel, add argmin reduction kernel soda-inria/sklearn-numba-dpex#46

Merged

2 tasks

fcharras mentioned this issue Jan 23, 2023

Eager jitting does not work with numba_dpex #816

Closed

diptorupd mentioned this issue Feb 17, 2023

Allow returning a SyclEvent from kernels instead of always adding a wait. #916

Closed

ZzEeKkAa closed this as completed Dec 20, 2023

ZzEeKkAa added this to the 0.22 milestone Dec 20, 2023

ZzEeKkAa added the feature label Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On making task configuration and task args available to the user without executing #769

On making task configuration and task args available to the user without executing #769

fcharras commented Aug 29, 2022

fcharras commented Aug 30, 2022

ogrisel commented Sep 8, 2022

fcharras commented Sep 21, 2022 •

edited

Loading

ogrisel commented Nov 4, 2022

fcharras commented Nov 4, 2022

fcharras commented Nov 25, 2022 •

edited

Loading

diptorupd commented May 22, 2023

fcharras commented May 23, 2023

diptorupd commented Dec 10, 2023

diptorupd commented Dec 19, 2023

ZzEeKkAa commented Dec 20, 2023

On making task configuration and task args available to the user without executing #769

On making task configuration and task args available to the user without executing #769

Comments

fcharras commented Aug 29, 2022

fcharras commented Aug 30, 2022

ogrisel commented Sep 8, 2022

fcharras commented Sep 21, 2022 • edited Loading

ogrisel commented Nov 4, 2022

fcharras commented Nov 4, 2022

fcharras commented Nov 25, 2022 • edited Loading

diptorupd commented May 22, 2023

fcharras commented May 23, 2023

diptorupd commented Dec 10, 2023

diptorupd commented Dec 19, 2023

ZzEeKkAa commented Dec 20, 2023

fcharras commented Sep 21, 2022 •

edited

Loading

fcharras commented Nov 25, 2022 •

edited

Loading