Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize BufWriter #79930

Merged
merged 7 commits into from
May 6, 2021
Merged

Conversation

tgnottingham
Copy link
Contributor

No description provided.

@rust-highfive
Copy link
Collaborator

r? @m-ou-se

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 11, 2020
@tgnottingham
Copy link
Contributor Author

tgnottingham commented Dec 11, 2020

This improves BufWriter performance by inlining the hot paths of write and write_all, not calling expensive Vec methods, and optimizing for the case where the input size is less than the buffer size.

Optimization was guided by rustc benchmarks. I have a local change that uses BufWriter more heavily in rustc. It changes code that wrote to a Vec, then wrote that Vec to a File, to code that writes directly to a BufWriter<File>.

Performance was initially completely unworkable.

Result of switching from baseline to current BufWriter

bufwriter_unoptimized

Here is the effect of adding each optimization to the current Bufwriter.

Optimization 1: inline hot paths
Optimization 2: avoid expensive Vec methods
Optimization 3: optimize for input size < buffer size

Result of Optimization 1 versus current BufWriter

bufwriter_opt1

Result of Optimization 1 + 2 versus Optimization 1

bufwriter_opt2

Result of Optimization 1 + 2 + 3 versus Optimization 1 + 2

bufwriter_opt3

Realize that these are benchmarks of the entirety of compilation, not just BufWriter code. So an increase in performance in the benchmarks represents a much larger improvement to BufWriter performance, at least according to rustc's use of it.

Obviously optimization 2 is the most impactful, optimization 1 is good, and optimization 3 would be good any other day of the week, but after the other two, it's like, whatever. :)

After all was said and done, the toll of using BufWriter heavily in rustc was far more tolerable:

Result of switching from baseline to optimized BufWriter

bufwriter_optimized

Still isn't quite where I want it to be though. See #79921 for details.

@tgnottingham
Copy link
Contributor Author

tgnottingham commented Dec 11, 2020

I'm aware that #78551 is also making changes in BufWriter. I'm hoping that this PR and @Lucretiel's can be reconciled. I'm pretty certain the change to avoid using Vec methods can be, but I'm not as sure about the inlining and common case optimization parts.

@the8472
Copy link
Member

the8472 commented Dec 14, 2020

Out of curiosity, have you experimented with a smaller hammer by using #[cold] or annotating the branches with likely/unlikely instead of #[inline(never)]?

@Kogia-sima
Copy link
Contributor

I have a concern that this optimization is unsound when buf.len() >= isize::MAX. It may cause overflow, go into the fast path, and apply copy operation excessing the buffer capacity.

@tgnottingham
Copy link
Contributor Author

Out of curiosity, have you experimented with a smaller hammer by using #[cold] or annotating the branches with likely/unlikely instead of #[inline(never)]?

I don't think I did--definitely not #[cold] anyway. I'll play with them.

Now that you mention it, I'm not quite sure what the right thing to do is here. If I don't separate the functions out, the rustc benchmarks take a pretty big hit. So I know that, for my modified rustc's use of BufWriter, it's better if they don't get inlined.

But maybe #[inline(never)] is too extreme? Maybe there are some situations where the optimizer can see that it would be better to inline them.

I could just remove the annotations, as currently, the "cold" functions don't get inlined in rustc, even without #[inline(never)]. But if they start getting inlined again as a result of some other change (and the lack of annotation), perf could take a hit for workloads where the inlining is harmful.

Anyway, I'm open to suggestions.

@tgnottingham
Copy link
Contributor Author

tgnottingham commented Dec 15, 2020

I have a concern that this optimization is unsound when buf.len() >= isize::MAX. It may cause overflow, go into the fast path, and apply copy operation excessing the buffer capacity.

Ah, thank you. Looks like there was a correctness issue already, if I understand correctly, but it didn't lead to unsoundness until my change. I'll fix the problem.

Update: after looking into this, there's mostly no concern about overflow, as Vecs, slices, and objects in general are limited to isize::MAX bytes. Although there's no official spec for the language mandating this, I think it's de facto official, based on things like this:

The isize type is a signed integer type with the same number of bits as the platform's pointer type. The theoretical upper bound on object and array size is the maximum isize value. This ensures that isize can be used to calculate differences between pointers into an object or array and can address every byte within an object along with one byte past the end.

Even so, I tried using buf.len() < self.buf.capacity() - self.buf.len() to remove the question of addition overflow entirely. But the performance took a hit (this benchmark is super sensitive), and I'd like to avoid that.

There is potential for overflow in write_vectored, though, as it sums many slice lengths together. I'll try to address it.

Update again: disregard the strike-through text. I've learned that there's no guarantee yet that the maximum slice size won't increase beyond isize::MAX in the future. I've changed the code to safeguard against that possibility.

@tgnottingham
Copy link
Contributor Author

I think I've addressed the possibility of overflow in write_vectored. It was a pre-existing issue, and so a bit outside the original scope of this PR. But I'm hoping that's alright, as fixing it is important for the soundness of the optimizations added by this PR.

@rust-log-analyzer

This comment has been minimized.

@tgnottingham
Copy link
Contributor Author

I've been told by @RalfJung that there's no guarantee yet that the maximum slice size won't increase beyond isize::MAX in the future. I'll update this PR to avoid the potential for overflow, even though it's not possible with the current language implementation.

Kogia-sima added a commit to rust-sailfish/sailfish that referenced this pull request Dec 17, 2020
@tgnottingham
Copy link
Contributor Author

Out of curiosity, have you experimented with a smaller hammer by using #[cold] or annotating the branches with likely/unlikely instead of #[inline(never)]?

I don't think I did--definitely not #[cold] anyway. I'll play with them.

Btw, recombining the hot/cold code and annotating the branches wasn't helpful. And using #[inline(never)] versus #[cold] versus no annotation at all didn't change the inlining decisions, at least for the case I'm testing. Still not sure what the best course of action is there.

@the8472
Copy link
Member

the8472 commented Dec 17, 2020

#[cold] feels more appropriate semantically, but other places in the standard library also use #[inline(never)] as an optimization against code bloat so I guess it doesn't really matter.

@mzabaluev
Copy link
Contributor

#[cold] and #[inline] serve different purposes. #[cold] hints the code generator to set branch prediction bias, if supported by the target architecture. I think the _cold functions should have both #[cold] and #[inline(never)].

@mzabaluev
Copy link
Contributor

I think the _cold functions should have both #[cold] and #[inline(never)].

Or perhaps, just #[cold] and let the optimizer decide about inlining. Same applies to #[inline(always)]; this attribute is prone to be misused.

@tgnottingham tgnottingham force-pushed the bufwriter_performance branch from 36844de to 5bfbe41 Compare January 5, 2021 05:13
@tgnottingham
Copy link
Contributor Author

Okay, I've rebased, added a comment about the overflow edge case discussed above, and changed #[inline(never)] and #[inline(always)] to #[cold] and #[inline] respectively.

@tgnottingham tgnottingham force-pushed the bufwriter_performance branch from 5bfbe41 to a84518a Compare January 5, 2021 05:17
@tgnottingham tgnottingham force-pushed the bufwriter_performance branch from 9a5792c to 76094fa Compare March 18, 2021 03:29
@tgnottingham
Copy link
Contributor Author

Rebased on latest master. Will address comments soon.

@JohnCSimon JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 4, 2021
@JohnCSimon
Copy link
Member

@rustbot label: +S-waiting-on-review -S-waiting-on-author

Ensure that `write` and `write_all` can be inlined and that their
commonly executed fast paths can be as short as possible.

`write_vectored` would likely benefit from the same optimization, but I
omitted it because its implementation is more complex, and I don't have
a benchmark on hand to guide its optimization.
We use a Vec as our internal, constant-sized buffer, but the overhead of
using methods like `extend_from_slice` can be enormous, likely because
they don't get inlined, because `Vec` has to repeat bounds checks that
we've already done, and because it makes considerations for things like
reallocating, even though they should never happen.
Optimize for the common case where the input write size is less than the
buffer size. This slightly increases the cost for pathological write
patterns that commonly fill the buffer exactly, but if a client is doing
that frequently, they're already paying the cost of frequent flushing,
etc., so the cost is of this optimization to them is relatively small.
@tgnottingham tgnottingham force-pushed the bufwriter_performance branch from 76094fa to 01e7018 Compare April 14, 2021 04:21
@tgnottingham
Copy link
Contributor Author

This change makes me wonder whether we should try to extend the interface of Vec to support use cases like this where you don't want it to grow (e.g. Vec::try_push or Vec::extend_from_slice_unchecked, etc.). Or if we should change Vec::spare_capacity_mut (which is still unstable) to return some wrapper type that doesn't only give you a &mut [MaybeUninit], but also allows updating len somehow. Or maybe a separate BoundedVec type would make more sense.

What do you think?

In this case, I think it's easy enough to get by with a Box<[MaybeUninit<u8>]. In general, I think a non-growing Vector might be nice to have, but it would have to be judged against whether or not it's worth adding to the Vec API (which is not something I think I'm qualified to evaluate! :)). Seems a bit fringe, but it does come up from time to time.

@crlf0710 crlf0710 added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 1, 2021
Comment on lines +429 to +434
let old_len = self.buf.len();
let buf_len = buf.len();
let src = buf.as_ptr();
let dst = self.buf.as_mut_ptr().add(old_len);
ptr::copy_nonoverlapping(src, dst, buf_len);
self.buf.set_len(old_len + buf_len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note that you could also implement this with self.buf.spare_capacity_mut() and MaybeUninit::write_slice, which are both still unstable.)

@m-ou-se
Copy link
Member

m-ou-se commented May 5, 2021

Thanks a lot for doing this!

@bors r+

@bors
Copy link
Contributor

bors commented May 5, 2021

📌 Commit 01e7018 has been approved by m-ou-se

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 5, 2021
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request May 6, 2021
@Mark-Simulacrum
Copy link
Member

@bors rollup=never - performance PR

@bors
Copy link
Contributor

bors commented May 6, 2021

⌛ Testing commit 01e7018 with merge 676ee14...

@bors
Copy link
Contributor

bors commented May 6, 2021

☀️ Test successful - checks-actions
Approved by: m-ou-se
Pushing 676ee14 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label May 6, 2021
@bors bors merged commit 676ee14 into rust-lang:master May 6, 2021
@rustbot rustbot added this to the 1.54.0 milestone May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.