Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for 128-bit integers #10

Merged
merged 2 commits into from
Sep 16, 2017
Merged

Add support for 128-bit integers #10

merged 2 commits into from
Sep 16, 2017

Conversation

henninglive
Copy link
Contributor

@henninglive henninglive commented Sep 12, 2017

Adds support for 128-bit integers on the nightly compiler feature gated behind an i128 feature flag.

I used the same implementation as used by the other types, but ended up being as slow as std::fmt on my system. I look into why later, when I have some time.

running 14 tests
test bench_fmt::bench_u128_0    ... bench:          80 ns/iter (+/- 38)
test bench_fmt::bench_u128_max  ... bench:       3,866 ns/iter (+/- 344)
test bench_fmt::bench_u64_max   ... bench:         102 ns/iter (+/- 22)
test bench_itoa::bench_u128_0   ... bench:          20 ns/iter (+/- 9)
test bench_itoa::bench_u128_max ... bench:       3,819 ns/iter (+/- 285)
test bench_itoa::bench_u64_max  ... bench:          39 ns/iter (+/- 7)

Resolves #9

@dtolnay
Copy link
Owner

dtolnay commented Sep 13, 2017

Wow I did not expect u128 to be 100 times slower than u64.

I would prefer not to merge this then. The advantage of this crate over std::fmt is that it avoids a small constant amount of overhead from the Formatter machinery. Going from 42ns to 10ns to print an i16 is a big deal, but going from 2530ns to 2500ns won't matter. what do you think?

@henninglive
Copy link
Contributor Author

I agree there no point in adding this if it’s this slow, but I want to do some profiling before closing this. At first glace, I believe the hotspots are __umodti3() and __udivti3() which is LLVM’s built-in implementation of divide and modulus for u128. They are very slow in compression to instructions. I noticed we are always calling them with same dividend and divisor so we should really be calling __udivmodti4() to get both the reminder and quotient in one operation, changing this might give substantial speedup.

Division with remainder on u128 is badly optimized by LLVM. Copying it
into our crate allows for inlining and proper optimization.
@henninglive
Copy link
Contributor Author

henninglive commented Sep 15, 2017

I have done some profiling and optimization. The reason it’s so slow are the __umodti3() and __udivti3() intrinsics. They are very slow compared the native div instruction. To make matters worse, LLVM currently fails to optimize and combine the two separate calls to __udivti3() and __umodti3() into a __udivmodti4() call, see rust-lang/rust/#44545. Funnily enough, __udivti3() and __umodti3() are implemented as calls to __udivmodti4(), which means we make two consecutively calls to same function with the same parameters.

We can optimize this my manually linking to __udivmodti4(), but i128/u128 does unfortunately not have a stable ABI, see rust-lang/rust/#44261. It’s possible, but it would be a hack. Alternatively we can just copy the function into the crate, this would also have the benefit allowing inlining, which won’t happen otherwise.

Here is benchmark showing the effects of these optimizations:

Unoptimized:
	test bench_fmt::bench_u128_max  ... bench:       3,046 ns/iter (+/- 183)
	test bench_fmt::bench_u64_max   ... bench:          59 ns/iter (+/- 4)
	test bench_itoa::bench_u128_max ... bench:       3,009 ns/iter (+/- 99)
	test bench_itoa::bench_u64_max  ... bench:          24 ns/iter (+/- 0)

Manual linking:
	test bench_itoa::bench_u128_max ... bench:       1,537 ns/iter (+/- 59)

Inlined:
	test bench_itoa::bench_u128_max ... bench:       1,289 ns/iter (+/- 54)

At this point I think implementing the latter option makes sense, it’s over 100% faster then std::fmt. I have therefore amended the pull request to include this change.

@dtolnay
Copy link
Owner

dtolnay commented Sep 15, 2017

Nicely done! I filed rust-lang/rust#44583 to provide your faster implementation in std::fmt if possible.

But that gets us back to the previous square -- once std has this implementation, itoa cannot really be any faster. We could provide the fast implementation in itoa until it reaches the stable compiler, then remove it. But maybe a better option would be to provide this in its own crate for now (in addition to a compiler PR).

@henninglive
Copy link
Contributor Author

henninglive commented Sep 15, 2017

I don't think inlining is going to happen in std, but once rust-lang/rust/#44545 is fixed std::fmt will be slightly slower then Manual linking. Inlining will still be 200ns faster than std::fmt. Rustc might be able to automatically inline the builtins in at some point, but not in the foreseeable future. If added this would be mutch faster for good while. Also, I think we should have 128-bit support in this crate whenever i128/u128 are stable, just for completeness.

@dtolnay dtolnay merged commit f05d2d6 into dtolnay:master Sep 16, 2017
@henninglive henninglive deleted the i128 branch April 20, 2018 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants