-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of ncodeunits(::Char)
#54001
Conversation
This improves performance of `ncodeunits(::Char)` by simply counting the number of non-zero bytes (except for `\0`, which is encoded as all zero bytes). For a performance comparison, see https://gist.github.com/Seelengrab/ebb02d4b8d754700c2869de8daf88cad. The version in this PR is called `nbytesencoded` in the benchmarks. Correctness has been verified with Supposition.jl, using the existing implementation as an oracle: ``` julia> using Supposition julia> const chars = Data.Characters() julia> @check max_examples=1_000_000 function bytesenc(c=chars) ncodeunits(c) == nbytesencoded(c) end; Test Summary: | Pass Total Time bytesenc | 1 1 1.2s julia> ncodeunits('\0') == nbytesencoded('\0') true ``` Notably, neither the existing nor the new implementation check whether the given `Char` is valid or not, since the only thing that matters is how many bytes are written out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure it's faster? On my x86 computer, your code is not faster than the old implementation, even when looping over a text with a varying number of codeunits per character (to defeat the branch predictor). It's possible that it's faster in your benchmark because your benchmark has an unrealistically unpredictable distribution of codeunits per char.
Also, some typos
Also this implementation is incorrect - it fails for some malformed chars like Perhaps this implementation is correct? function ncodeunits(c::Char)
u = reinterpret(UInt32, c)
4 - div(trailing_zeros(u), 8) + iszero(u)
end |
Could you share that benchmark? As far as I know,
Good catch! I'll look into that. |
I took some Greek text (which has a mix of 1- and 2-byte chars), looped over the chars and then summed the codeunits. The idea was that randomly generated chars have an unrealistic distribution of codepoint lengths, and therefore unfairly punishes branchful code since the branch predictor will be much better on real strings with a more predictable number of codeunits per char. |
Indeed it is, and it's faster than all of the masking I've done because of the heavy lifting in If it's alright with you, I'll use your implementation in the PR? The fault in my original implementation was assuming that the leading bytes must all be nonzero, which is of course not true for malformed Results on my machine: julia> @code_native debuginfo=:none nbytesdiv('a')
.text
.file "nbytesdiv"
.globl julia_nbytesdiv_15365 # -- Begin function julia_nbytesdiv_15365
.p2align 4, 0x90
.type julia_nbytesdiv_15365,@function
julia_nbytesdiv_15365: # @julia_nbytesdiv_15365
; Function Signature: nbytesdiv(Char)
# %bb.0: # %top
#DEBUG_VALUE: nbytesdiv:c <- $edi
push rbp
tzcnt ecx, edi
xor eax, eax
mov rbp, rsp
shr ecx, 3
cmp edi, 1
adc rax, 4
sub rax, rcx
pop rbp
ret
.Lfunc_end0:
.size julia_nbytesdiv_15365, .Lfunc_end0-julia_nbytesdiv_15365
# -- End function
.section ".note.GNU-stack","",@progbits
julia> @benchmark nbytesdiv(c) setup=(c=rand(Char))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 1.689 ns … 12.760 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.720 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.729 ns ± 0.169 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▁
▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▂▇▁▁▁▁▁▁▂█▁▁▁▁▁▁▂█▁▁▁▁▁▁▂▅▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▃ ▂
1.69 ns Histogram: frequency by time 1.76 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark sum(nbytesdiv, c) setup=(c=rand(Char, 10_000))
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.115 μs … 2.093 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.152 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.156 μs ± 38.131 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃▄▅▅▆▇██▆▆▅▃▁▁ ▁ ▂▂▂ ▂
▄▅▇▇▆▅▄▁▃▁▄▅███████████████████████▇██▇▃▁▄▄▄▃▅▅▆▅▃▃▃▃▁▃▅▅▇ █
1.12 μs Histogram: log(frequency) by time 1.22 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @check max_examples=1_000_000 function bytesenc(i=Data.Integers{UInt32}())
c = reinterpret(Char, i)
ncodeunits(c) == nbytesdiv(c)
end;
Test Summary: | Pass Total Time
bytesenc | 1 1 1.0s |
This came up in the context of JuliaLang/julia#54001, where doing the correct thing for malformed `Char` slipped through. Thanks, @jakobnissen!
Yep, you can use that implementation 👍 |
The previous implementation assumed that all `Char` are well-formed, which is of course not guaranteed to be the case (and which is also correctly handled by the existing implementation). On top of that, this is even faster, since counting the number of trailing zeros has hardware support on a wide range of architectures. Implementation based on a suggestion by @jakobnissen. Thanks!
9c9b6f6
to
19b29df
Compare
Style nit: I'm not sure the blank lines do anything for readability, how about just: function ncodeunits(c::Char)
u = reinterpret(UInt32, c)
# We care about how many trailing bytes are all zero
n_nonzero_bytes = sizeof(UInt32) - div(trailing_zeros(u), 0x8)
# Take care of '\0', which has an all-zero bitpattern
n_nonzero_bytes + iszero(u)
end |
Failures seem to be due to Github/CI hiccups, unrelated to this PR.
Sure, that's reasonable. I've also added a clarification that the line doesn't just compute the number of trailing nonzero bytes. |
Failure seems to be the Profile thing again - see #53984. Otherwise IMO good to merge. |
This came up in the context of JuliaLang/julia#54001, where doing the correct thing for malformed `Char` slipped through. Thanks, @jakobnissen!
This improves performance of
ncodeunits(::Char)
by simply counting the number of non-zero bytes (except for\0
, which is encoded as all zero bytes). For a performance comparison, see this gist; there's an up to 10x improvement here for collections ofChar
, with a minor improvement for singleChar
(with much smaller spread). The version in this PR is callednbytesencoded
in the benchmarks.Correctness has been verified with Supposition.jl, using the existing implementation as an oracle:
Let's see if CI agrees!
Notably, neither the existing nor the new implementation check whether the given
Char
is valid or not, since the only thing that matters is how many bytes are written out.