Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On making task configuration and task args available to the user without executing #769

Closed
fcharras opened this issue Aug 29, 2022 · 11 comments
Labels
user User submitted issue
Milestone

Comments

@fcharras
Copy link

It seems the underlying dpctl API supports dispatching SyclKernel asynchronously and numba_dpex.kernel is a few code lines away of making available to the user tasks that can be fed to SyclQueues and mapped to SyclEvents that embeds a DAG of execution dependencies.

This small test script show how it works, the async_iter function here is five time faster that its sync counterpart.

Would you say that this is a correct take and that it could be interesting to use, would there be a case for exposing corresponding public methods ?

@fcharras
Copy link
Author

@oleksandr-pavlyk maybe you have insight on this ?

@ogrisel
Copy link

ogrisel commented Sep 8, 2022

As discussed IRL, this is related to the problem described in:

Based on my understanding we need to make sure that queue.wait() before Python exits. We could achieve with a try / finally in the users' code. Alternatively we could use the Python atexit module to register a callback whenever we use the dpctl queue directly.

@fcharras
Copy link
Author

fcharras commented Sep 21, 2022

If I understand correctly the relevant paragraph is:

A user has to make sure that all tasks, including the host tasks, complete before exiting the script. To do that the user can explicitly wait on the returned events, or wait directly on the used execution queues.

If all what it takes to avoid issues is ensuring a queue.wait() is there really a limitation here ? it doesn't sound unreasonable to expect from users to wait() if it's well documented, and if not expect chaotic behavior. And (apart from brutal terminations like segfaults and sigterms) it's easy to ensure that a wait is issued at shutdown like @ogrisel pointed out (using try/finally, a context-manager, or atexit,...).

I have a doubt about the meaning of the text that follows:

The event-based design for asynchronous execution is needed due to SYCL queues being out-of-order by default. Ordering tasks in such queues requires the use of events. SYCL also support in-order queues, in which the task execution graph is linear and use of events is not needed. In order to support concurrent execution of SYCL kernels by capable hardware, it becomes necessary to use several independent in-order queues built for the same combination of device and context. Synchronizing tasks in different queues, both in-order and out-of-order can be accomplished using the sycl_ext_oneapi_enqueue_barrier extension exposed in dpctl as dpctl.SyclQueue.submit_barrier method.

If I undertand correctly, taking this function as example:

  • even though in this example there are two independant DAGs that are submitted to the queue, the kernels are in fact going to be executed sequentially, disregarding the device support for concurrent execution, because there's only one queue. (It still might be interesting performance-wise because the dispatching from the python interpreter to the queue will be concurrent to the execution of the dispatched kernels, thus masking the dispatch overhead, but nothing more)
  • to unlock concurrent execution those two independant DAGs should be inputed using independant queues
  • and those queues should have the "in-order" property ?

@ogrisel
Copy link

ogrisel commented Nov 4, 2022

even though in this example there are two independant DAGs that are submitted to the queue, the kernels are in fact going to be executed sequentially, disregarding the device support for concurrent execution, because there's only one queue.

Is there a way to confirm whether or not numba_dpex/dpctl use an in-order queue in this case? There is no option to use an out-of-order queue with explicit dependencies on the tasks themselves?

@fcharras
Copy link
Author

fcharras commented Nov 4, 2022

@ogrisel sycl queues have an attribute that says if it is in order or not: https://intelpython.github.io/dpctl/latest/docfiles/dpctl/SyclQueue.html#dpctl.SyclQueue.is_in_order

@fcharras
Copy link
Author

fcharras commented Nov 25, 2022

@diptorupd

Hearing about refactoring in #816 I wondered if you also plan to work towrad this ? If I understand correctly this feature would unlock the equivalent of cuda streams in numba.cuda and there are various issues that refers to it (e.g. #147 )

@diptorupd
Copy link
Contributor

@fcharras @ogrisel @oleksandr-pavlyk

The bulk of the internal refactoring that was planned for the kernel API is now in master. PR #1049 is going to remove the support for NumPy arrays as kernel arguments.

Once #1049 is merged, we are free to support returning an event from the kernel submission call. There are certain design questions that should be addressed before that:

  1. How do we ensure the arguments to the kernel are kept alive till the duration of kernel execution. Is it the programmer or the compiler who is responsible?
  2. What happens when the returned event object is not stored in a variable by the user? Should we move the wait call into the SyclEvent destructor?

@fcharras
Copy link
Author

Thank you for the work @diptorupd

Regarding the design questions here are my thoughts. Those features are interesting for several reasons:

  • using a custom SyclQueue object rather than letting numba_dpex implicitly decides which SyclQueue a kernel is fed to. This is interesting because different queues can run kernels concurrently.

  • unlocking the parameters of SyclQueue.submit that are not currently exposed in numba_dpex, currently the only such argument is the dEvents argument, that enable building a computational dependency graph. Then calling a single wait at the end once the graph is registered. This allow kernel execution to run asynchronously with respect to the python interpreter, and it will hide the costs of whatever the python interpreter is doing in the meantime.

  • and finally, being able to re-use a kernel that has been specialized, with the same inputs, repeatedly, without having to:

    • query cache every time (the user expects to re-use the same kernel so computing a hash and querying cache should not be needed)
    • go through validation and unpacking of the inputs every time

The cost of those steps is low but not negligible so it's good if it can be removed once a kernel have been specialized.

To avoid mistakes with cache and input types, I think it's fair to limit the API to specialized kernels. Once a kernel have been specialized there's no reason to probe the cache to call the kernel object later on.

Couldn't it be something as simple as:

# `data_parallel_sum2` will have type `SyclKernel` 
@ndpx.kernel([(i64arrty, i64arrty, i64arrty), (f32arrty, f32arrty, f32arrty)], sycl_kernel=True)
def data_parallel_sum2(a, b, c):
    i = ndpx.get_global_id(0)
    c[i] = a[i] + b[i]

a = dpt.ones(1024, dtype=dpt.int64)
b = dpt.ones(1024, dtype=dpt.int64)
c = dpt.zeros(1024, dtype=dpt.int64)

args = data_parallel_sum2.validate_and_unpack(a, b, c)  # needs an utility to manually call input validation once
my_sycl_queue = dpctl.SyclQueue()
my_event = my_sycl_queue.submit(data_parallel_sum2, args, gs=1024, ls=512, dEvent=None)
my_event.wait()

Beside that I think that addressing 1. or 2., by keeping alive inputs as long as a task is not completed, or automatically calling wait on pending events in case of interpreter shutdown (features which probably requires implementing callbacks on task completion), would be dpctl job, and it's fine letting users deal with those issues too if dpctl doesn't provide that, it sounds manageable by users at the python level.

For 1. maybe I fail to see what is at stake here ? Worst case scenario, users could have kernels that read or write arrays that have been de-allocated, and get segfaults (on CPU) or nonsensical results (on GPU). This is already something numba_dpex users will face when writing dpex.kernel kernels anyway, since there's no automated boundaries check on __getitem__ calls in kernels and it seems that gpu kernels can read arbitrary adresses (which if anything is a bit worrying from a security point of view ?).

For 2 I'm not quite sure what is the risk there either ? I've never seen a device task not be cancelled when the interpreter is SIGKILL'd, and for proper termination (SIGTERM) it sounds fine warning the user (with documentation / examples) that he's responsible with calling wait properly (be it in context managers or atexit finalizers...), or else face undefined behavior (maybe segfaults).

@diptorupd
Copy link
Contributor

@fcharras @ogrisel Asynchronous kernel submission was a feature we have recently added to numba_dpex.experimental.kernel. The user guide and examples still need to be updated, but you can look at the test case: https://github.com/IntelPython/numba-dpex/blob/main/numba_dpex/tests/experimental/test_async_kernel.py for an example on how to use the feature.

We will migrate the feature to our core module once we have tested it out a bit more and few other related changes are also ready to be migrated.

@diptorupd
Copy link
Contributor

@ZzEeKkAa another one that should be marked as done post #1249

@ZzEeKkAa
Copy link
Contributor

Implemented with #1249 . Usage doc will be provided at #147

@ZzEeKkAa ZzEeKkAa added this to the 0.22 milestone Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user User submitted issue
Projects
None yet
Development

No branches or pull requests

5 participants