Add support for 128-bit integers #10

henninglive · 2017-09-12T21:07:53Z

Adds support for 128-bit integers on the nightly compiler feature gated behind an i128 feature flag.

I used the same implementation as used by the other types, but ended up being as slow as std::fmt on my system. I look into why later, when I have some time.

running 14 tests
test bench_fmt::bench_u128_0    ... bench:          80 ns/iter (+/- 38)
test bench_fmt::bench_u128_max  ... bench:       3,866 ns/iter (+/- 344)
test bench_fmt::bench_u64_max   ... bench:         102 ns/iter (+/- 22)
test bench_itoa::bench_u128_0   ... bench:          20 ns/iter (+/- 9)
test bench_itoa::bench_u128_max ... bench:       3,819 ns/iter (+/- 285)
test bench_itoa::bench_u64_max  ... bench:          39 ns/iter (+/- 7)

Resolves #9

dtolnay · 2017-09-13T03:55:13Z

Wow I did not expect u128 to be 100 times slower than u64.

I would prefer not to merge this then. The advantage of this crate over std::fmt is that it avoids a small constant amount of overhead from the Formatter machinery. Going from 42ns to 10ns to print an i16 is a big deal, but going from 2530ns to 2500ns won't matter. what do you think?

henninglive · 2017-09-13T04:27:11Z

I agree there no point in adding this if it’s this slow, but I want to do some profiling before closing this. At first glace, I believe the hotspots are __umodti3() and __udivti3() which is LLVM’s built-in implementation of divide and modulus for u128. They are very slow in compression to instructions. I noticed we are always calling them with same dividend and divisor so we should really be calling __udivmodti4() to get both the reminder and quotient in one operation, changing this might give substantial speedup.

Division with remainder on u128 is badly optimized by LLVM. Copying it into our crate allows for inlining and proper optimization.

henninglive · 2017-09-15T00:40:15Z

I have done some profiling and optimization. The reason it’s so slow are the __umodti3() and __udivti3() intrinsics. They are very slow compared the native div instruction. To make matters worse, LLVM currently fails to optimize and combine the two separate calls to __udivti3() and __umodti3() into a __udivmodti4() call, see rust-lang/rust/#44545. Funnily enough, __udivti3() and __umodti3() are implemented as calls to __udivmodti4(), which means we make two consecutively calls to same function with the same parameters.

We can optimize this my manually linking to __udivmodti4(), but i128/u128 does unfortunately not have a stable ABI, see rust-lang/rust/#44261. It’s possible, but it would be a hack. Alternatively we can just copy the function into the crate, this would also have the benefit allowing inlining, which won’t happen otherwise.

Here is benchmark showing the effects of these optimizations:

Unoptimized:
	test bench_fmt::bench_u128_max  ... bench:       3,046 ns/iter (+/- 183)
	test bench_fmt::bench_u64_max   ... bench:          59 ns/iter (+/- 4)
	test bench_itoa::bench_u128_max ... bench:       3,009 ns/iter (+/- 99)
	test bench_itoa::bench_u64_max  ... bench:          24 ns/iter (+/- 0)

Manual linking:
	test bench_itoa::bench_u128_max ... bench:       1,537 ns/iter (+/- 59)

Inlined:
	test bench_itoa::bench_u128_max ... bench:       1,289 ns/iter (+/- 54)

At this point I think implementing the latter option makes sense, it’s over 100% faster then std::fmt. I have therefore amended the pull request to include this change.

dtolnay · 2017-09-15T00:56:47Z

Nicely done! I filed rust-lang/rust#44583 to provide your faster implementation in std::fmt if possible.

But that gets us back to the previous square -- once std has this implementation, itoa cannot really be any faster. We could provide the fast implementation in itoa until it reaches the stable compiler, then remove it. But maybe a better option would be to provide this in its own crate for now (in addition to a compiler PR).

henninglive · 2017-09-15T01:21:28Z

I don't think inlining is going to happen in std, but once rust-lang/rust/#44545 is fixed std::fmt will be slightly slower then Manual linking. Inlining will still be 200ns faster than std::fmt. Rustc might be able to automatically inline the builtins in at some point, but not in the foreseeable future. If added this would be mutch faster for good while. Also, I think we should have 128-bit support in this crate whenever i128/u128 are stable, just for completeness.

Add support for 128-bit integers

e9069ce

Copy udivmodti4 from compiler-builtins

1d85a0c

Division with remainder on u128 is badly optimized by LLVM. Copying it into our crate allows for inlining and proper optimization.

dtolnay mentioned this pull request Sep 15, 2017

Display impl for u128 and i128 is slow rust-lang/rust#44583

Closed

dtolnay approved these changes Sep 16, 2017

View reviewed changes

dtolnay merged commit f05d2d6 into dtolnay:master Sep 16, 2017

dtolnay mentioned this pull request Sep 16, 2017

Write u128 using only two divisions #12

Merged

henninglive deleted the i128 branch April 20, 2018 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for 128-bit integers #10

Add support for 128-bit integers #10

henninglive commented Sep 12, 2017 •

edited

Loading

dtolnay commented Sep 13, 2017

henninglive commented Sep 13, 2017

henninglive commented Sep 15, 2017 •

edited

Loading

dtolnay commented Sep 15, 2017

henninglive commented Sep 15, 2017 •

edited

Loading

Add support for 128-bit integers #10

Add support for 128-bit integers #10

Conversation

henninglive commented Sep 12, 2017 • edited Loading

dtolnay commented Sep 13, 2017

henninglive commented Sep 13, 2017

henninglive commented Sep 15, 2017 • edited Loading

dtolnay commented Sep 15, 2017

henninglive commented Sep 15, 2017 • edited Loading

henninglive commented Sep 12, 2017 •

edited

Loading

henninglive commented Sep 15, 2017 •

edited

Loading

henninglive commented Sep 15, 2017 •

edited

Loading