-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: ~15% Faster str to int parsing (when base 10) #83088
Conversation
… here. Rather than multiplying the same number each time, if we're multiplying different numbers then the ALU can reorder them (as they have fewer data dependencies) and do more of them in parallel. If radix is called with a constant (which it is 99% of the time) I'm pretty sure that llvm evaluates the if at compile time as to which algorythm to pick. I noticed that when radix == 16 it can figure out that it can shift rather than multiply which is pretty impressive in my book that it can spot that! Speedup wise I'm seeing it go from 1.4ns to 1.2ns. Not earth shattering, but not to be sniffed at either.
(rust-highfive has picked a reviewer for you, use r? to override) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
If it's too specific being constrained to u32 sized and below for the std lib I understand. I can always do this as a separate crate. |
This comes from a talk from Andrei Alexandrescu, fastware (timestamp important) In the past I played around with it, and implemented this parsing technique for each integer type with macros as seen here. and even made a const fn version of it: https://gist.github.com/DutchGhost/d8604a3c796479777fe9f5e25d855cfd I dont remember my bench results exactly, but the speed improvement was not all that much, but maybe that was just my implementation. It's a neat parsing technique :) |
Thank you! I couldn’t refind the talk anywhere. It’s a great video to
watch- eye opening!
…On Mon, 15 Mar 2021 at 08:47, DutchGhost ***@***.***> wrote:
This comes from a talk from Andrei Alexandrescu:
https://youtu.be/o4-CwDo2zpg?t=2346 (timestamp important)
In the past I played around with it, and implemented this parsing
technique for each integer type with macros:
https://github.com/DutchGhost/byte-num/blob/a72222fd7927d3bf5d203389c360ccb1b874ca41/src/from_ascii.rs#L72,
and even made a const fn version of it:
https://gist.github.com/DutchGhost/d8604a3c796479777fe9f5e25d855cfd
I dont remember my bench results exactly, but the speed improvement was
not all that much, but maybe that was just my implementation. It's a neat
parsing technique :)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#83088 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGEJCEB4HUIAQKJPW263JDTDXCTPANCNFSM4ZEAKXMQ>
.
|
This seems like a reasonable optimization, if it ends up providing an improvement, and passes all tests. Could you please address the test failures, rename the multiplier array to something like |
The job Click to see the possible cause of the failure (guessed by this bot)
|
I think the other PR has more effect and is broader in scope than this PR. |
if is_positive { | ||
// The number is positive | ||
for &c in digits { | ||
let x = match (c as char).to_digit(radix) { | ||
Some(x) => x, | ||
None => return Err(PIE { kind: InvalidDigit }), | ||
}; | ||
let x = match factors[idx].checked_mul(x) { | ||
Some(result) => result, | ||
None => return Err(PIE { kind: PosOverflow }), | ||
}; | ||
result = match result.checked_add(x) { | ||
Some(result) => result, | ||
None => return Err(PIE { kind: PosOverflow }), | ||
}; | ||
idx += 1; | ||
} | ||
} else { | ||
// The number is negative | ||
for &c in digits { | ||
let x = match (c as char).to_digit(radix) { | ||
Some(x) => x, | ||
None => return Err(PIE { kind: InvalidDigit }), | ||
}; | ||
let x = match factors[idx].checked_mul(x) { | ||
Some(result) => result, | ||
None => return Err(PIE { kind: NegOverflow }), | ||
}; | ||
result = match result.checked_sub(x) { | ||
Some(result) => result, | ||
None => return Err(PIE { kind: NegOverflow }), | ||
}; | ||
idx += 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to have so many duplicated code here? Also, why not put the extra condition branch in the error flow rather the default flow?
Maybe we can use iterator for idx here?
if is_positive { | |
// The number is positive | |
for &c in digits { | |
let x = match (c as char).to_digit(radix) { | |
Some(x) => x, | |
None => return Err(PIE { kind: InvalidDigit }), | |
}; | |
let x = match factors[idx].checked_mul(x) { | |
Some(result) => result, | |
None => return Err(PIE { kind: PosOverflow }), | |
}; | |
result = match result.checked_add(x) { | |
Some(result) => result, | |
None => return Err(PIE { kind: PosOverflow }), | |
}; | |
idx += 1; | |
} | |
} else { | |
// The number is negative | |
for &c in digits { | |
let x = match (c as char).to_digit(radix) { | |
Some(x) => x, | |
None => return Err(PIE { kind: InvalidDigit }), | |
}; | |
let x = match factors[idx].checked_mul(x) { | |
Some(result) => result, | |
None => return Err(PIE { kind: NegOverflow }), | |
}; | |
result = match result.checked_sub(x) { | |
Some(result) => result, | |
None => return Err(PIE { kind: NegOverflow }), | |
}; | |
idx += 1; | |
} | |
for &c in digits { | |
let x = match (c as char).to_digit(radix) { | |
Some(x) => x, | |
None => return Err(PIE { kind: InvalidDigit }), | |
}; | |
let x = match factors[idx].checked_mul(x) { | |
Some(result) => result, | |
None => return Err(PIE { kind: if is_positive { PosOverflow } else { NegOverflow } }), | |
}; | |
result = match result.checked_add(x) { | |
Some(result) => result, | |
None => return Err(PIE { kind: if is_positive { PosOverflow } else { NegOverflow } }), | |
}; | |
idx += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you missed there was a checked_sub
in there? Having an if in the loop to choose between checked_add and checked_sub kills performance - I can attest to that! I did wonder if we could do better too. You can't add and then negate as the negative ranges are slightly bigger. So you could always subtract and then !+1 to make positive, which would work great for signed types (with a check to see if val == T::Min) but I couldn't get my head around what would happen if I was doing subtractions to unsigned types - if I have to do unsigned types a different way then I've not made anything any simpler. So I can't think of a way to simplify the code, but am willing to try anything if you have any other ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I did missed that.
But what if we put the if
branch only for checked_add
and checked_sub
but keep the to_digit
and checked_mul
?
Another idea, can you also try using if it's negative after the for loop to do checked_neg
there? Remove the for loop and add a branch to convert the positive number to negative using checked_neg
and return NegOverflow
if there are issues so we don't need another loop for negative, a single branch.
While trying that, maybe can also try a hand crafted negation instead of checked_add
? Like this
for &c in digits {
... // checked_add ...
}
if !is_positive {
const UNSIGNED_BITS: i8 = (i8::MAX as u8 * 2) as i8 + 1; // don't just use i8, can use whatever type it is
result = (i ^ UNSIGNED_BITS).checked_add(1);
}
One more thing, I don't think we need the whole bunch of negative calculation in unsigned code do we?
If would be better if you have a repo to try out stuff on the benchmarking, like the one I did https://github.com/pickfire/parseint so that I can also try to help out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For i8
if we're parsing "-128"
we run into representation trouble if we start positive and then negate, that's why one would want to do subtractions and then neg. But we'd need a different code path for unsigned...
I don't think we need the whole bunch of negative calculation in unsigned code do we?
Hmm, we could do if is_positive || !is_signed_ty {
rather than if is_positive
- that way the compiler should remove the else block for unsigned types (I don't think currently the compiler can realise that is_positive is always true for unsigned types, but it's pretty crafty so it might have figured that out anyhows).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fork your parseint and add some benchmarks. (The trick
is very cool you've got there btw.)
I need to double check my benchmarks anyhow because I reaslised that my blackbox was not quite in the right place and the compiler was optimising for the data (rookie mistake!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you link your godbolt experiments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in the godbolt is an exploration of expanding to larger type sizes than u32 but it's slow due to an if in the loop. Nevertheless it demostrates the is_signed_ty
evaluation nicely.
Sure, on the left it's a const that evaluates to false and on the right it's an expression. https://godbolt.org/z/od3ajTh3v
Interestingly on the left side if I replace
const is_signed_ty: bool = false;
with
let is_signed_ty: bool = false;
it can't tell that it should drop the else expression which is surprising. I would have thought it should have proporgated the constant into the is_signed_ty given it was assigned by a constant and is not mutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That godbolt link shows unoptimized assembly. You forgot to set -Copt-level=3
. With that the assembly is identical between the two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or -O
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pickfire I took a stab at building in all the safeguards to your trick as an alternative route for radix 10. Have submitted a PR for us to talk around: pickfire/parseint#2
let x = match (c as char).to_digit(radix) { | ||
Some(x) => x, | ||
None => return Err(PIE { kind: InvalidDigit }), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These match
es can all be written a bit shorter like this:
let x = match (c as char).to_digit(radix) { | |
Some(x) => x, | |
None => return Err(PIE { kind: InvalidDigit }), | |
}; | |
let x = (c as char).to_digit(radix).ok_or(PIE { kind: InvalidDigit })?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall we used to do these but in performance sensitive parts using ok_or
is slower than match
counterparts, from what I recall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? The optimizer should have all information it needs and in my quick tests the generated assembly is identical. Is there an open issue or do you have a test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's what I recall but not sure if the same goes for this case, if the assembly is the same then I guess it's good to have it.
Some(result) => result, | ||
None => return Err(PIE { kind: PosOverflow }), | ||
}; | ||
idx += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to manage an index manually if you loop over the factors directly:
for (&c, &factor) in digits.zip(factors) { ... }
Assuming zip doesn't inhibit any optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both zip
and enumerate
inhibited optimisations - I like that euphamism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen some weird interactions between zip
and rev
, which currently(?) cause suboptimal assembly, but never with enumerate
. Is this another case of missing -Copt-level=3
on godbolt?
PS: No euphemism intended.
if is_positive { | ||
// The number is positive | ||
for &c in digits { | ||
let x = match (c as char).to_digit(radix) { | ||
Some(x) => x, | ||
None => return Err(PIE { kind: InvalidDigit }), | ||
}; | ||
let x = match factors[idx].checked_mul(x) { | ||
Some(result) => result, | ||
None => return Err(PIE { kind: PosOverflow }), | ||
}; | ||
result = match result.checked_add(x) { | ||
Some(result) => result, | ||
None => return Err(PIE { kind: PosOverflow }), | ||
}; | ||
idx += 1; | ||
} | ||
} else { | ||
// The number is negative | ||
for &c in digits { | ||
let x = match (c as char).to_digit(radix) { | ||
Some(x) => x, | ||
None => return Err(PIE { kind: InvalidDigit }), | ||
}; | ||
let x = match factors[idx].checked_mul(x) { | ||
Some(result) => result, | ||
None => return Err(PIE { kind: NegOverflow }), | ||
}; | ||
result = match result.checked_sub(x) { | ||
Some(result) => result, | ||
None => return Err(PIE { kind: NegOverflow }), | ||
}; | ||
idx += 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could move that for loop into a closure or macro or something that gets PosOverflow
/NegOverflow
and checked_add
/checked_sub
as parameters.
IIRC @dtolnay has done some work around this in the ecosystem, do you think you'd be interested in reviewing this PR? Otherwise I'll try to take a look soon. |
I am thinking of closing down this PR in favour of #83371 I am not sure that the technique in this PR has benefits scailed up to u64. I am looking to see if I can safely weaponise pickire's tricks at radix 10 parsing, but for now I think that should live in a separate atoi_radix10 crate where it can get proved out. |
Try not to publish a crate (so we don't occupy a name, like what I did to ve), also can just ping me for review, I would be happy to do that. |
I have done float parsing, and I have done int and float printing, but so far not int parsing. Would love if someone else were able to spearhead the review on this one. |
I have done porting a verify specific version of int parsing from c++. I believe @gilescope is currently experimenting to get a faster parser by having another repository to test his ideas out. @gilescope you may want to change this pull request to draft in the meantime. I will also take a look at it after that since we could have a faster way of testing out the new changes. |
Good idea
…On Wed, 14 Apr 2021 at 17:06, Ivan Tham ***@***.***> wrote:
I have done porting a verify specific version of int parsing from c++. I
believe @gilescope <https://github.com/gilescope> is currently
experimenting to get a faster parser by having another repository to test
his ideas out. @gilescope <https://github.com/gilescope> you may want to
change this pull request to draft in the meantime.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#83088 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGEJCALZWXIDRNTO3FFPZTTIW4SDANCNFSM4ZEAKXMQ>
.
|
I'm going to close this for now to reduce confusion. I am working on base10, but I don't know when I'm going to be done done (there always seems to be some new idea to try). It's also a way off being generic so not that suitable for inclusion into std lib yet (if ever as there's quite a bit more code) but I do want to answer the question of what's the fastest way we can parse ints. Meanwhile I think we should move forward with #83371 - there's no downside except a little more code and it speeds up all radix. |
I did finally get to a generic way of parsing base 10 strings fast and published it as |
I saw a c++ talk on numbers and wondered if the same trick would work here.
Rather than multiplying the same number each time, if we're multiplying different numbers then the ALU can reorder them (as they have fewer data dependencies and add is commutative) and do more of them in parallel.
If radix is called with a constant (which it is 99% of the time) I'm pretty sure that llvm evaluates the
if radix == 10
at compile time - I noticed that whenradix == 16
llvm can figure out that it can shift rather than multiply! Pretty impressive in my book that it can spot that!Speedup wise I'm seeing it go from 1.4ns to 1.2ns. Not earth shattering, but not to be sniffed at either.