ZeroCopy "buffer released" notification/future #234

redbaron · 2023-02-09T08:29:18Z

redbaron
Feb 9, 2023

When using zerocopy operations SEND_ZC and SENDMSG_ZC there are 2 CQEs returned: one for the operation itself and second at some point later when buffer is released and can be reused.

Here is a good description of it from LWN Article:

A zero-copy networking implementation must have a way to inform applications when any given operation is truly complete; the application cannot reuse a buffer containing data to be transmitted if the kernel is still working on it. There is a subtle point that is relevant here: the completion of a send() call (for example) does not imply that the associated buffer is no longer in use. The operation "completes" when the data has been accepted into the networking subsystem for transmission; the higher layers may well be done with it, but the buffer itself may still be sitting in a network interface's transmission queue. A zero-copy operation is only truly done with its data buffers when the hardware has done its work — and, for many protocols, when the remote peer has acknowledged receipt of the data. That can happen long after the operation that initiated the transfer has completed.

Currently tokio-uring merges 2 notifications into a single future, with negative consequence that user visible future will be completed much later for ZC version vs non-ZC. This is happening because non-ZC future completes when first CQE arrives, that is when data was written to TCP buffers for instance, but for ZC call second CQE (and therefore user visible future) completes when all layers in kernel are done with the buffer, which for TCP case is when buffer was ACKed by the remote.

IMHO it make sense for send[msg]_zc to return 2 futures: one for the first "operation-completed" CQE, matching non-ZC counterpart and second "buffer-is-free" for second CQE indicating when buffer can be released.

User can then use second future to track lifecycle of the buffer or better tokio-uring could provide a buffer tracker helper, which owns buffers and their corresponding "buffer-is-free" future and freeing buffers on future completion. It is responsibility of user to periodically poll buffer tracker for it to do it's work. For fixed buffers, fixed buffer registry can become such tracker.

/cc @ollie-etl as original author of zerocopy support.

FrankReh · 2023-02-09T11:44:53Z

FrankReh
Feb 9, 2023
Maintainer

Why do you want the first notification presented to the user? And the uring API doesn't promise there will be one or that there will be just one.

1 reply

redbaron Feb 9, 2023
Author

Why do you want the first notification presented to the user?

For same reason we do for non-ZC operations, to tell user that their request is "done" (what it actually means is heavily dependend on operation and protocol)

uring API doesn't promise there will be one

my understanding is that every SQE is matched by at least one CQE

that there will be just one.

for any uring operation + flag there is a guarantee how many CQEs will be received. SEND_ZC (EDIT: + IORING_SEND_ZC_REPORT_USAGE) for instance will get 2.

FrankReh · 2023-02-09T15:44:49Z

FrankReh
Feb 9, 2023
Maintainer

It's my understanding their request is "done" when the final CQE is received.

Can you give some example or color to how a user would interpret an intermediate CQE (one with a more bit set) if we made that available? I don't use those commands myself. But I know in one case at least, there is a bytes value that is used to increment a total so when the final CQE is received, the total bytes written can be reported to the user.

0 replies

redbaron · 2023-02-09T17:56:04Z

redbaron
Feb 9, 2023
Author

It's my understanding their request is "done" when the final CQE is received.

Then your difinition of "done" differs depending on whether you use send or send_zc, which in my view is not desirable. Regardless of ZC or not , I'd expect to be notified at the same point of the request lifecycle. It is just with ZC we need additional notification to inform us about buffer status, but "main" (first) notification should be issued at the same point for send and send_zc.

Lets have a look at very slow, but fresh (as in nothing was sent yet) TCP connection and

let (result, _) = stream.write(b"hello world!".as_slice()).await;

when we get result? When buffer is writen to in-kernel TCP socket buffers, which is immediately in our case, because there is room for our buffer.

How it will look like with write_zc (I know tokio-uring doesn't support it, but my intention is to add such support)?

let (result, buf) = stream.write_zc(b"hello world!".as_slice()).await;

when we get result in this case? Currently at the very differnt time. We get result when second CQE is received, the one which notifies that buffer is not in use anymore. For our TCP connection it will be when remote peer ACKs our buffer. This difference is what I am asking to eliminate. ZC should be just an optimization, not change of semantic.

How it might look like? I propose ZC methods to return 2 futures: one for "main" CQE and second for buffer free notification. Current behaviour would look like following then:

let (main_future, buf_is_free) = stream.write_zc(b"hello world!".as_slice());
let result = main_future.await;
let buf = buf_is_free.await;

It is clearly suboptimal, because second future can take arbitrary long time and we want to make progress with our app. Better option is to handle it to some buffer manager and make sure to poll it in the main loop and let it free buffers.

let (main_future, buf_is_free) = stream.write_zc(b"hello world!".as_slice());
let (result, _) = main_future.await;
buffer_manager.register(buf_is_free);    // make sure to poll buffer_manager in the main loop
// continue program without waiting for buffer to be released

3 replies

FrankReh Feb 9, 2023
Maintainer

And you would be sure there is just one more CQE coming for this file type? Is that true of all file types? The io_uring documentation didn't spell out one more followed by one final, I think it just said there could be multiple more cqe.

I understand you have a different definition of done. Most of us working on this crate use the definition of the buffer being returned. That's when the operation is done. You want to know stuff before hand and I can see your point of wanting to issue another write, even if it means using a different buffer, you want to be able to keep the channel well utilized.

I think you're going to get easier buy-in from multiple people if you propose a solution that is more flexible than just two futures. Maybe a callback with which you can trigger any other event you like, or maybe a builder that gives you fine grained control of parameters, like a channel or a stream.

redbaron Feb 9, 2023
Author

And you would be sure there is just one more CQE coming for this file type? Is that true of all file types? The io_uring documentation didn't spell out one more followed by one final, I think it just said there could be multiple more cqe.

I am not the author of uring patches and didn't use it in anger, but my understanding is that for any given SQE user might expect predictable number of CQEs, even when the error occurs. SQE flags affecting number of CQEs are clearly documented and it is unreasonable to think that number of CQEs can change without user opting in with a flag in SQE even when future kernel versions introduce some changes. Unknown numbers CQEs are currently possible with multishot requests, but these are on receiving side and dont use ZC writes.

Most of us working on this crate use the definition of the buffer being returned. That's when the operation is done.

What is "done" in case of multishot which gives you buffers, instead of you providing them? There is a common pattern for all uring operations: every SQE creates one or more (in case of multishot) "regular" CQE with result of the operation you are trying to make. This flow is at the core of of uring, what used to be return value + errno of a syscall becomes cqe->result. In addition to regular CQE, user can request auxiliary CQEs, currently only of type IORING_CQE_F_NOTIF (no doubt there will be more) which is sent when user opts in into these notification. What is important that auxiliary CQEs must be distinguishable from regular "result" CQEs, otherwise no sane program can be written.

So we can count on number of CQEs and their "type" for any given SQEs. Given that I think we can commit to returning fixed number of futures for SQEs we control (as in don't allow flags to be passed by users or we sanitize those flags)

You want to know stuff before hand and I can see your point of wanting to issue another write, even if it means using a different buffer, you want to be able to keep the channel well utilized.

Man page summarizes well why different CQEs are used.

The flags field of the first struct io_uring_cqe may likely contain IORING_CQE_F_MORE , which means that there will be a second completion event / notification for the request, with the user_data field set to the same value. The user must not modify the data buffer until the notification is posted. The first cqe follows the usual rules and so its res field will contain the number of bytes sent or a negative error code. The notification's res field will be set to zero and the flags field will contain IORING_CQE_F_NOTIF . The two step model is needed because the kernel may hold on to buffers for a long time, e.g. waiting for a TCP ACK, and having a separate cqe for request completions allows userspace to push more data without extra delays. Note, notifications are only responsible for controlling the lifetime of the buffers, and as such don't mean anything about whether the data has atually been sent out or received by the other end. Even errored requests may generate a notification, and the user must check for IORING_CQE_F_MORE rather than relying on the result.

They could have kept using same CQE, just issue it later, but decided against it. If speed is the goal of using ZC, then pumping socket is the only way forward.

if you propose a solution that is more flexible than just two futures.

as stated above, I think we can guarantee number of futures for any given send_zc operation. Future is most ergonomic way I can think of to handle it.

Maybe a callback with which you can trigger any other event you like

AFAIK there is no way to have useful callbacks (== with closures) without boxing them.

FrankReh Feb 9, 2023
Maintainer

We'll see where others want to take this discussion. I agree that the io_uring developers probably had a reason for returning the more cqes and so far we are just ignoring that reason.

Also know our plans are to support a streaming interface for the multishot operations. Maybe the final cqe would become the final streamed item or maybe it would be its own return after the stream, that is likely still debatable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeroCopy "buffer released" notification/future #234

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

ZeroCopy "buffer released" notification/future #234

redbaron Feb 9, 2023

Replies: 3 comments · 4 replies

FrankReh Feb 9, 2023 Maintainer

redbaron Feb 9, 2023 Author

FrankReh Feb 9, 2023 Maintainer

redbaron Feb 9, 2023 Author

FrankReh Feb 9, 2023 Maintainer

redbaron Feb 9, 2023 Author

FrankReh Feb 9, 2023 Maintainer

redbaron
Feb 9, 2023

Replies: 3 comments 4 replies

FrankReh
Feb 9, 2023
Maintainer

redbaron Feb 9, 2023
Author

FrankReh
Feb 9, 2023
Maintainer

redbaron
Feb 9, 2023
Author

FrankReh Feb 9, 2023
Maintainer

redbaron Feb 9, 2023
Author

FrankReh Feb 9, 2023
Maintainer