Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of ncodeunits(::Char) #54001

Merged
merged 3 commits into from
Apr 9, 2024

Conversation

Seelengrab
Copy link
Contributor

@Seelengrab Seelengrab commented Apr 9, 2024

This improves performance of ncodeunits(::Char) by simply counting the number of non-zero bytes (except for \0, which is encoded as all zero bytes). For a performance comparison, see this gist; there's an up to 10x improvement here for collections of Char, with a minor improvement for single Char (with much smaller spread). The version in this PR is called nbytesencoded in the benchmarks.

Correctness has been verified with Supposition.jl, using the existing implementation as an oracle:

julia> using Supposition

julia> const chars = Data.Characters()

julia> @check max_examples=1_000_000 function bytesenc(i=Data.Integers{UInt32}())
           c = reinterpret(Char, i)
           ncodeunits(c) == nbytesdiv(c)
       end;
Test Summary: | Pass  Total  Time
bytesenc      |    1      1  1.0s

julia> ncodeunits('\0') == nbytesencoded('\0')
true

Let's see if CI agrees!

Notably, neither the existing nor the new implementation check whether the given Char is valid or not, since the only thing that matters is how many bytes are written out.

This improves performance of `ncodeunits(::Char)` by simply counting
the number of non-zero bytes (except for `\0`, which is encoded
as all zero bytes). For a performance comparison, see
https://gist.github.com/Seelengrab/ebb02d4b8d754700c2869de8daf88cad.
The version in this PR is called `nbytesencoded` in the benchmarks.

Correctness has been verified with Supposition.jl, using the existing
implementation as an oracle:

```
julia> using Supposition

julia> const chars = Data.Characters()

julia> @check max_examples=1_000_000 function bytesenc(c=chars)
           ncodeunits(c) == nbytesencoded(c)
       end;
Test Summary: | Pass  Total  Time
bytesenc      |    1      1  1.2s

julia> ncodeunits('\0') == nbytesencoded('\0')
true
```

Notably, neither the existing nor the new implementation check whether
the given `Char` is valid or not, since the only thing that matters
is how many bytes are written out.
@Seelengrab Seelengrab added the performance Must go faster label Apr 9, 2024
Copy link
Contributor

@jakobnissen jakobnissen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it's faster? On my x86 computer, your code is not faster than the old implementation, even when looping over a text with a varying number of codeunits per character (to defeat the branch predictor). It's possible that it's faster in your benchmark because your benchmark has an unrealistically unpredictable distribution of codeunits per char.

Also, some typos

base/char.jl Outdated Show resolved Hide resolved
base/char.jl Outdated Show resolved Hide resolved
@jakobnissen
Copy link
Contributor

jakobnissen commented Apr 9, 2024

Also this implementation is incorrect - it fails for some malformed chars like reinterpret(Char, 0xff0000ff).

Perhaps this implementation is correct?

function ncodeunits(c::Char)
    u = reinterpret(UInt32, c)
    4 - div(trailing_zeros(u), 8) + iszero(u)
end

@Seelengrab
Copy link
Contributor Author

Seelengrab commented Apr 9, 2024

Are you sure it's faster? On my x86 computer, your code is not faster than the old implementation, even when looping over a text with a varying number of codeunits per character (to defeat the branch predictor).

Could you share that benchmark? As far as I know, rand(Char) should produce a good sampling of inputs here.

Also this implementation is incorrect - it fails for some malformed chars like reinterpret(Char, 0xff0000ff).

Good catch! I'll look into that.

@jakobnissen
Copy link
Contributor

I took some Greek text (which has a mix of 1- and 2-byte chars), looped over the chars and then summed the codeunits. The idea was that randomly generated chars have an unrealistic distribution of codepoint lengths, and therefore unfairly punishes branchful code since the branch predictor will be much better on real strings with a more predictable number of codeunits per char.
However, thinking more about it, it's probably not a big deal. Performance is on par and your code relies on less magical compiler optimisation so it's better. By proposed one using trailing_zeros is even faster.

@Seelengrab
Copy link
Contributor Author

Seelengrab commented Apr 9, 2024

Perhaps this implementation is correct?

Indeed it is, and it's faster than all of the masking I've done because of the heavy lifting in trailing_zeros being done by the CPU. This ends up as a tzcnt, which is unlikely to be beat. The division afterwards is also efficient, since it ends up as a shr ecx, 3, i.e. x >> 0x3; everything here is in integers after all.

If it's alright with you, I'll use your implementation in the PR? The fault in my original implementation was assuming that the leading bytes must all be nonzero, which is of course not true for malformed Char. The original @check didn't catch this because Data.Characters only produces well-formed Char, and no malformed ones. Seems like a missing feature switch on my end.

Results on my machine:

julia> @code_native debuginfo=:none nbytesdiv('a')
	.text
	.file	"nbytesdiv"
	.globl	julia_nbytesdiv_15365           # -- Begin function julia_nbytesdiv_15365
	.p2align	4, 0x90
	.type	julia_nbytesdiv_15365,@function
julia_nbytesdiv_15365:                  # @julia_nbytesdiv_15365
; Function Signature: nbytesdiv(Char)
# %bb.0:                                # %top
	#DEBUG_VALUE: nbytesdiv:c <- $edi
	push	rbp
	tzcnt	ecx, edi
	xor	eax, eax
	mov	rbp, rsp
	shr	ecx, 3
	cmp	edi, 1
	adc	rax, 4
	sub	rax, rcx
	pop	rbp
	ret
.Lfunc_end0:
	.size	julia_nbytesdiv_15365, .Lfunc_end0-julia_nbytesdiv_15365
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

julia> @benchmark nbytesdiv(c) setup=(c=rand(Char))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  1.689 ns  12.760 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.720 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.729 ns ±  0.169 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                           █       ▁                          
  ▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▂▇▁▁▁▁▁▁▂█▁▁▁▁▁▁▂█▁▁▁▁▁▁▂▅▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▃ ▂
  1.69 ns        Histogram: frequency by time        1.76 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum(nbytesdiv, c) setup=(c=rand(Char, 10_000))
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.115 μs   2.093 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.152 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.156 μs ± 38.131 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▁▃▄▅▅▆▇██▆▆▅▃▁▁  ▁ ▂▂▂                        ▂
  ▄▅▇▇▆▅▄▁▃▁▄▅███████████████████████▇██▇▃▁▄▄▄▃▅▅▆▅▃▃▃▃▁▃▅▅▇ █
  1.12 μs      Histogram: log(frequency) by time     1.22 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @check max_examples=1_000_000 function bytesenc(i=Data.Integers{UInt32}())
           c = reinterpret(Char, i)
           ncodeunits(c) == nbytesdiv(c)
       end;
Test Summary: | Pass  Total  Time
bytesenc      |    1      1  1.0s

Seelengrab added a commit to Seelengrab/Supposition.jl that referenced this pull request Apr 9, 2024
This came up in the context of JuliaLang/julia#54001,
where doing the correct thing for malformed `Char` slipped through.

Thanks, @jakobnissen!
@tecosaur tecosaur added the strings "Strings!" label Apr 9, 2024
@jakobnissen
Copy link
Contributor

Yep, you can use that implementation 👍

The previous implementation assumed that all `Char` are well-formed,
which is of course not guaranteed to be the case (and which is also
correctly handled by the existing implementation). On top of that,
this is even faster, since counting the number of trailing zeros
has hardware support on a wide range of architectures.

Implementation based on a suggestion by @jakobnissen. Thanks!
@Seelengrab Seelengrab force-pushed the ncodeunits_speedup branch from 9c9b6f6 to 19b29df Compare April 9, 2024 11:20
@tecosaur
Copy link
Contributor

tecosaur commented Apr 9, 2024

Style nit: I'm not sure the blank lines do anything for readability, how about just:

function ncodeunits(c::Char)
    u = reinterpret(UInt32, c)
    # We care about how many trailing bytes are all zero
    n_nonzero_bytes = sizeof(UInt32) - div(trailing_zeros(u), 0x8)
    # Take care of '\0', which has an all-zero bitpattern
    n_nonzero_bytes + iszero(u)
end

@Seelengrab
Copy link
Contributor Author

Seelengrab commented Apr 9, 2024

Failures seem to be due to Github/CI hiccups, unrelated to this PR.

Style nit: I'm not sure the blank lines do anything for readability, how about just:

Sure, that's reasonable. I've also added a clarification that the line doesn't just compute the number of trailing nonzero bytes.

@Seelengrab
Copy link
Contributor Author

Seelengrab commented Apr 9, 2024

Failure seems to be the Profile thing again - see #53984. Otherwise IMO good to merge.

@oscardssmith oscardssmith merged commit d183ee1 into JuliaLang:master Apr 9, 2024
5 of 7 checks passed
Seelengrab added a commit to Seelengrab/Supposition.jl that referenced this pull request May 3, 2024
This came up in the context of JuliaLang/julia#54001,
where doing the correct thing for malformed `Char` slipped through.

Thanks, @jakobnissen!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants