Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Stream ReadAtLeast and ReadExactly #69272

Merged
merged 16 commits into from
May 20, 2022
Merged

Conversation

eerhardt
Copy link
Member

Adds methods to Stream to read at least a minimum amount of bytes, or a full buffer, of data from the stream.
ReadAtLeast allows for the caller to specify whether an exception should be thrown or not on the end of the stream.

Make use of the new methods where appropriate in net7.0.

Fix #16598

Adds methods to Stream to read at least a minimum amount of bytes, or a full buffer, of data from the stream.
ReadAtLeast allows for the caller to specify whether an exception should be thrown or not on the end of the stream.

Make use of the new methods where appropriate in net7.0.

Fix dotnet#16598
@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost ghost assigned eerhardt May 12, 2022
@ghost
Copy link

ghost commented May 12, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Adds methods to Stream to read at least a minimum amount of bytes, or a full buffer, of data from the stream.
ReadAtLeast allows for the caller to specify whether an exception should be thrown or not on the end of the stream.

Make use of the new methods where appropriate in net7.0.

Fix #16598

Author: eerhardt
Assignees: -
Labels:

area-System.IO, new-api-needs-documentation

Milestone: -

int bytesRead;
int offset = 0;
while (offset < buffer.Length)
int bytesRead = await TAdapter.ReadAtLeastAsync(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a risk here. Previously if the async read pends, it would force the ReadMessageAsync state machine to be allocated. Now if the async read pends, it will force the ReadMessageAsync state machine to be allocated and the ReadAtLeastAsync state machine to be allocated.

We should consider using the pooled async state machine builders on the ReadAtLeastAsync async methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this as simple as adding

[AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]

to the Stream.ReadAtLeastAsyncCore method? Or is it more involved than that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation-wise, that's it, easy peasy. The important thing is validating performance. The concern with such pooling is that it can have a negative impact on the larger system, especially if it ends up creating more gen2 to gen0 references (by referencing gen0 objects from pooled objects that are likely to eventually be gen2). But I worry that without doing this, the methods will end up not being used because folks will see the allocation and choose to avoid the helpers as a result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's chat tomorrow about how I can validate the performance here. I'd like to get your thoughts on what to try / look for.

Copy link
Member

@carlossanlop carlossanlop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this change! Left some comments and questions.

@eerhardt eerhardt marked this pull request as ready for review May 13, 2022 22:18
@eerhardt
Copy link
Member Author

Thanks for the review so far. I now believe this is fully ready for review. PTAL.

@stephentoub
Copy link
Member

stephentoub commented May 17, 2022

My personal preference would be to throw for empty buffers

More specifically, I think we should throw for minimumBytes <= 0.

There are two main situations where someone might be passing 0 to a read method:

  • Want the "wait for data to be available" behavior
  • Doing math to compute how much to read and not special-casing 0

Read is available for the former. For the latter, a direct call to ReadAtLeast with a specified lower bound of 0 doesn't seem likely, given that this is expected to be used in scenarios like protocol parsing where you're expecting some number of bytes necessary to make forward progress, and for ReadExactly, it'll typically be used in situations where you're trying to read everything in the payload into some buffer you have... doing math is much more common with Read, where you might need to loop, and we wouldn't expect much looping around ReadExactly.

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on responses provided by @stephentoub it LGTM. Thanks!

eerhardt added 2 commits May 17, 2022 15:43
ReadAtLeast throws when minimumBytes is not positive.
ReadExactly throws when an empty buffer is passed, or count is not positive.
Copy link
Member

@carlossanlop carlossanlop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some documentation suggestions.

@GSPP
Copy link

GSPP commented May 18, 2022

How would ReadExactly be used to read a length-prefixed binary value from some protocol? A zero-length binary data is no special case and should just work.

My understanding is that this call would currently throw. I'm not sure why there is a need to create a special case where, naturally, no special case is necessary.

@stephentoub mentioned that this can happen if "buffer math" results in a zero-length read. I don't see this as a bad thing. It can be an outcome of natural code patterns.

@stephentoub
Copy link
Member

stephentoub commented May 18, 2022

I'm not sure why there is a need to create a special case

Because otherwise it could hang until there's data available. If you want that behavior, Read already exists and provides everything needed. In your example scenario, it's almost certainly not the behavior you want.

An alternative is special-casing the new methods to handle 0 by always returning immediately rather than throwing.

@GSPP
Copy link

GSPP commented May 18, 2022

Oh, now I see. If we have a zero-length read, this would cause many streams to wait until more data is incoming! That's a bug, and I'm afraid I will now have to check some of my code for this particular mistake 😁. Thanks for explaining.

But that's actually an argument to add that special case check. That seems to fit the use-case perfectly.

Without that special case, many callers will need to special case this themselves. I have done quite a bit of protocol work, and zero-length data is not uncommon.

@stephentoub
Copy link
Member

An alternative is special-casing the new methods to handle 0 by always returning immediately

@eerhardt, what would you think of this?

@eerhardt
Copy link
Member Author

eerhardt commented May 18, 2022

I don't really like having different behavior between Read(buffer, 0, 0) and ReadAtLeast(buffer, 0). My first proposal was to make these behave roughly equivalent - block until data is available. Passing in an empty buffer into both ReadAtLeast and ReadExactly became a problem though, since 0 was returned from the Read operation, and that looked like an end-of-stream.

At a high-level, I can think of the following possibilities:

  1. Check up front and throw an exception. (current PR behavior)
  2. Return immediately, don't do a "read", even though a "read" method was called.
  3. Always call Read, with 2 options on when an empty buffer is passed:
    a. Throw an exception
    b. Special case an empty buffer, and allow for 0 to be returned from Read. Return successfully for ReadExactly and 0 for ReadAtLeast.

My current thinking is that 3(b) seems to be the most consistent with Stream.Read, and would be reasonable behavior.

@stephentoub
Copy link
Member

stephentoub commented May 18, 2022

My current thinking is that 3(b) seems to be the most consistent with Stream.Read, and would be reasonable behavior.

I don't think that's desirable. It means a scenario like:

int numBytesInPayload = ReadPayloadSize(stream);
byte[] payload = new byte[numBytesInPayload];
stream.ReadExactly(payload);

could block indefinitely even when the payload has been fully and successfully read.

I do think 0 is special for ReadExactly and ReadAtLeast and needs to be treated differently from what Read does. We could choose to throw (1) or return immediately (2), but calling Read with 0 (3) is problematic.

If someone were using their own read loop for this, they would be special-casing 0, e.g.

int numBytesInPayload = ReadPayloadSize(stream);
byte[] payload = new byte[numBytesInPayload];
int totalRead = 0;
while (totalRead < numBytesInPayload)
{
    int n = stream.Read(payload.AsSpan(totalRead));
    if (n == 0) throw ...;
    totalRead += n;
}

such that stream.Read would never be called if numBytesInPayload were 0. We want them to be able to switch to do that instead with ReadExactly, but with any variant of (3), they'd have to special-case 0 themselves:

if (payload != 0)
{
    stream.ReadExactly(payload);
}

Now, maybe we'd want them to do that anyway in this particular example, in order to prefer using Array.Empty<byte>() in that case. But that's an optimization, and this API is meant to be the easy button.

Is there a use case where someone would be using ReadExactly or ReadAtLeast, passing in 0, and wanting the "wait until data is available" behavior that stream.Read might provide?

@eerhardt
Copy link
Member Author

Are there two different scenarios here?

  1. ReadExactly and ReadAtLeast with an empty buffer (which implies minimumBytes must be 0).
  2. ReadAtLeast with a non-empty buffer, but passing in minimumBytes = 0.

In case the non-empty buffer case, would you expect no Read to happen?

@stephentoub
Copy link
Member

stephentoub commented May 18, 2022

Are there two different scenarios here?

That's part of my question... what is the scenario for calling ReadAtLeast with a non-zero-length buffer and minimumBytes == 0?
Why wouldn't you just use Read? I'm not clear on in what situation you'd say "please read data but if you feel like not giving any back that's cool".

@eerhardt
Copy link
Member Author

eerhardt commented May 18, 2022

Maybe the behavior split makes sense between the new methods then?

  • ReadExactly with an empty buffer (or count == 0 for the array overload) - no-ops
  • ReadAtLeast with minimumBytes == 0 - throws

I just wouldn't expect ReadAtLeast to no-op. It feels closer to Read, and isn't the easy button. If we don't have a scenario for ReadAtLeast(buffer, 0), maybe it makes the most sense to throw.

@stephentoub
Copy link
Member

stephentoub commented May 18, 2022

Actually, I just answered my own question ("I'm not clear on in what situation you'd say "please read data but if you feel like not giving any back that's cool"). Consider the same scenario I outlined earlier, but where the protocol parser is maintaining its own buffer. It wants to read the payload but also any additional data that might be available, in order to avoid unnecessary additional calls for when it wants more data later. That's what ReadAtLeast is for, and you wouldn't want it to throw for 0 (0 is valid) nor block waiting for more (because the 0 was satisfied). Allowing 0 to just nop means ReadAtLeast works for non-0-length in this manner, giving back as much as it can, while still enabling 0 to work without throwing or blocking.

Stream.Read special-cases 0 (on some streams). ReadExactly and ReadAtLeast serve different purposes than does Read, and since they're implemented in terms of it (effectively as an implementation detail), they then also need to special-case 0. I think it makes sense for them to special-case it to mean "I've read what you asked and now I'm done". It would be nice if ReadAtLeast could give more back if more was available, but to do so it would need to call Read with 0, which could block.

@stephentoub
Copy link
Member

I just wouldn't expect ReadAtLeast to no-op

Why? It's in the name "at least"... when it has at least 0, it's done.

@eerhardt
Copy link
Member Author

OK, I can make this change.

It would be nice if ReadAtLeast could give more back if more was available, but to do so it would need to call Read with 0, which could block.

This might be a case for making ReadAtLeast virtual - so a derived stream could give back data if it was already available, without blocking. But I don't see it as a "need", and we can add it in the future if necessary.

…Bytes == 0.

Both of these cases are a no-op. No exception is thrown. A read won't be issued to the underlying stream. They won't block until data is available.
@eerhardt eerhardt merged commit de27aa1 into dotnet:main May 20, 2022
@eerhardt eerhardt deleted the StreamReadAtLeast branch May 20, 2022 14:20
eerhardt added a commit to eerhardt/runtime that referenced this pull request May 26, 2022
dotnet#69272 changed DCS to no longer call Stream.Read inside a loop, but instead call the new ReadAtLeast method. ReadAtLeast only takes a Span<byte>, and not a byte[]. This caused a regression because the internal encoding stream wrapper classes don't override Read(Span<byte>), so the base implementation is used. The base implementation is slower because it needs to rent a byte[] from the pool, and do 2 copies.

Overriding Read(Span<byte>) on the internal encoding stream wrapper classes allows ReadAtLeast to be just as fast.

Fix dotnet#69730
eerhardt added a commit that referenced this pull request May 28, 2022
)

* Fix Stream.ReadAtLeast perf regression in DataContractSerializer

#69272 changed DCS to no longer call Stream.Read inside a loop, but instead call the new ReadAtLeast method. ReadAtLeast only takes a Span<byte>, and not a byte[]. This caused a regression because the internal encoding stream wrapper classes don't override Read(Span<byte>), so the base implementation is used. The base implementation is slower because it needs to rent a byte[] from the pool, and do 2 copies.

Overriding Read(Span<byte>) on the internal encoding stream wrapper classes allows ReadAtLeast to be just as fast.

Fix #69730

* Add Span overloads for Stream Read and Write implementations that don't currently override the Span overloads.
@ghost ghost locked as resolved and limited conversation to collaborators Jun 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide a version of Stream.Read that reads as much as possible
8 participants