Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: eachsplit for iterative splitting #39245

Merged
merged 2 commits into from
Sep 8, 2021

Conversation

anaveragehuman
Copy link
Contributor

@anaveragehuman anaveragehuman commented Jan 14, 2021

Fixes #20603, replacing/closes #20688, closes #7027.

We might also want an eachrsplit to match rsplit.

Initial benchmarks show this being more expensive than the existing split. Advice on how to improve this would be appreciated.

Existing implementation:

julia> @benchmark split("α β γ", " ")
BenchmarkTools.Trial:
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     267.648 ns (0.00% GC)
  median time:      281.473 ns (0.00% GC)
  mean time:        306.094 ns (2.61% GC)
  maximum time:     7.298 μs (95.83% GC)
  --------------
  samples:          10000
  evals/sample:     315

julia> @btime sum(length, split($"The quick brown fox jumps over the lazy dog."))
  876.213 ns (4 allocations: 800 bytes)
36

This PR:

@benchmark split("α β γ", " ")
BenchmarkTools.Trial:
  memory estimate:  384 bytes
  allocs estimate:  5
  --------------
  minimum time:     285.511 ns (0.00% GC)
  median time:      301.029 ns (0.00% GC)
  mean time:        333.977 ns (5.43% GC)
  maximum time:     10.054 μs (95.45% GC)
  --------------
  samples:          10000
  evals/sample:     276

julia> @btime sum(length, split($"The quick brown fox jumps over the lazy dog."))
  1.116 μs (13 allocations: 1.34 KiB)
36

julia> @btime sum(length, eachsplit($"The quick brown fox jumps over the lazy dog."))
  813.111 ns (9 allocations: 576 bytes)
36

base/strings/util.jl Outdated Show resolved Hide resolved
base/strings/util.jl Outdated Show resolved Hide resolved
Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with minor comments

@anaveragehuman
Copy link
Contributor Author

Thanks for reviewing! I'm able to achieve 0 allocations with iteration using a loop, but there's still a noticeable overhead compared to the previous split implementation, some of which seems to be added by Iterators.filter when keepempty=false.

I've added another commit to use eachsplit instead of split where possible throughout Base so those functions can benefit from the speedup and smaller memory footprint.

Latest master:

julia> @benchmark split("α β γ", " ")
BenchmarkTools.Trial:
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     236.418 ns (0.00% GC)
  median time:      240.995 ns (0.00% GC)
  mean time:        257.252 ns (2.55% GC)
  maximum time:     4.492 μs (94.14% GC)
  --------------
  samples:          10000
  evals/sample:     421

julia> @btime sum(length, split($"The quick brown fox jumps over the lazy dog."))
  878.023 ns (4 allocations: 800 bytes)
36

This PR:

julia> @benchmark split("α β γ", " ")
BenchmarkTools.Trial:
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     255.710 ns (0.00% GC)
  median time:      269.169 ns (0.00% GC)
  mean time:        283.707 ns (2.81% GC)
  maximum time:     6.288 μs (94.93% GC)
  --------------
  samples:          10000
  evals/sample:     355

julia> @btime sum(length, split($"The quick brown fox jumps over the lazy dog."))
  1.052 μs (13 allocations: 1.34 KiB)
36

julia> @btime sum(length, split("The quick brown fox jumps over the lazy dog."; keepempty=true))
  967.529 ns (4 allocations: 800 bytes)
36

julia> @btime sum(length, eachsplit($"The quick brown fox jumps over the lazy dog."))
  752.438 ns (0 allocations: 0 bytes)
36

base/strings/util.jl Outdated Show resolved Hide resolved
@anaveragehuman
Copy link
Contributor Author

I couldn't figure out why Iterators.filter allocates so I moved the filtering into the function. Looks like performance is pretty close to the current implementation now.

Current master:

julia> @benchmark split("α β γ", " ")
BenchmarkTools.Trial:
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     268.547 ns (0.00% GC)
  median time:      275.657 ns (0.00% GC)
  mean time:        291.221 ns (2.48% GC)
  maximum time:     6.723 μs (95.40% GC)
  --------------
  samples:          10000
  evals/sample:     318

julia> @btime sum(length, split($"The quick brown fox jumps over the lazy dog."))
  908.389 ns (4 allocations: 800 bytes)
36

This PR:

julia> @benchmark split("α β γ", " ")
BenchmarkTools.Trial:
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     243.090 ns (0.00% GC)
  median time:      247.575 ns (0.00% GC)
  mean time:        263.489 ns (2.89% GC)
  maximum time:     5.694 μs (94.25% GC)
  --------------
  samples:          10000
  evals/sample:     398

julia> @btime sum(length, split($"The quick brown fox jumps over the lazy dog."))
  887.864 ns (4 allocations: 800 bytes)
36

julia> @btime sum(length, eachsplit($"The quick brown fox jumps over the lazy dog."))
  627.118 ns (0 allocations: 0 bytes)
36

@fredrikekre fredrikekre added strings "Strings!" needs news A NEWS entry is required for this change labels Apr 28, 2021
@cmcaine
Copy link
Contributor

cmcaine commented Sep 6, 2021

The "needs news" label is inaccurate now.

This looks good to go? Maybe triage can make a decision on it and report here?

This commit moves the existing splitting implementation into an iterator
named `eachsplit` and changes the definition of `split(...)` to
`collect(eachsplit(...))`, plus a few edge cases.
@anaveragehuman
Copy link
Contributor Author

Rebased again. I'm not completely happy about the new split being a little slower, but I suppose the takeaway is to use the iterating version if performance is important.

Current master:

julia> @benchmark split("α β γ", " ")
BenchmarkTools.Trial: 10000 samples with 405 evaluations.
 Range (min  max):  240.978 ns    5.592 μs  ┊ GC (min  max): 0.00%  93.55%
 Time  (median):     264.570 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   292.964 ns ± 199.246 ns  ┊ GC (mean ± σ):  3.05% ±  4.45%

   ▂▄██▇▆▃▃▂▂▂▃▂▂▁                                              ▂
  ▇████████████████▇▇▇▆▆▆▆▆▅▅▆▅▅▄▅▆▅▆▆▆▆▆▇▇▆▇▆▇▇▆▇▆▇▇▇▇▇▆▆▆▆▆▆▇ █
  241 ns        Histogram: log(frequency) by time        554 ns <

 Memory estimate: 272 bytes, allocs estimate: 2.

julia> @btime sum(length, split($"The quick brown fox jumps over the lazy dog."))
  958.864 ns (3 allocations: 1.25 KiB)
36

This PR:

julia> @benchmark split("α β γ", " ")
BenchmarkTools.Trial: 10000 samples with 340 evaluations.
 Range (min  max):  258.824 ns    8.959 μs  ┊ GC (min  max): 0.00%  95.93%
 Time  (median):     289.379 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   316.502 ns ± 276.307 ns  ┊ GC (mean ± σ):  3.61% ±  4.03%

  ▁ ▄▃▇▂█▃▃▃▃▂▂▃▂▂▂                                             ▁
  █▆██████████████████▇▆▇▇▇▆▆▄▅▆▆▅▆▆▆▅▅▅▆▆▆▆▇▆▆▇▇▆▆▆▆▇▆▇▇▆▇▆▇▆▆ █
  259 ns        Histogram: log(frequency) by time        557 ns <

 Memory estimate: 272 bytes, allocs estimate: 2.

julia> @btime sum(length, split($"The quick brown fox jumps over the lazy dog."))
  981.739 ns (3 allocations: 1.25 KiB)
36

julia> @btime sum(length, eachsplit($"The quick brown fox jumps over the lazy dog."))
  721.389 ns (0 allocations: 0 bytes)
36

@Moelf
Copy link
Contributor

Moelf commented Sep 6, 2021

also needs test?

@cmcaine
Copy link
Contributor

cmcaine commented Sep 8, 2021

This PR implements split with eachsplit and uses eachsplit in a few other places in Base, so it's kind of already covered by the existing tests. Not sure it needs any more?

@vtjnash vtjnash merged commit dba8a08 into JuliaLang:master Sep 8, 2021
@anaveragehuman anaveragehuman deleted the eachsplit branch September 9, 2021 02:50
@sostock
Copy link
Contributor

sostock commented Sep 9, 2021

Would it make sense to put this into Base.Iterators (as Iterators.split), since Base.Iterators defines lazy versions Base functions (cf. map/Iterators.map or filter/Iterators.filter)?

Also, it would be good to add a compat admonition to the docstring (“This function requires at least Julia 1.8”).

aviatesk added a commit that referenced this pull request Sep 10, 2021
Somehow type constraints from the complex `while` condition don't
propagate to the `while` body.
@stevengj
Copy link
Member

stevengj commented Sep 13, 2021

Might be nice to have reverse-iteration support, i.e. to implement iterate(r::Iterators.Reverse{SplitIterator}, state...).

(That would also give you e.g. last(eachsplit(...), n).)

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
This moves the existing splitting implementation into an iterator
named `eachsplit` and changes the definition of `split(...)` to
`collect(eachsplit(...))`, plus a few edge cases.
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
This moves the existing splitting implementation into an iterator
named `eachsplit` and changes the definition of `split(...)` to
`collect(eachsplit(...))`, plus a few edge cases.
@jariji
Copy link
Contributor

jariji commented Sep 16, 2022

Inspired by a question on Zulip, I'll ask, why is this function called eachsplit rather than Iterators.split? What is "each split" supposed to mean?

@Moelf
Copy link
Contributor

Moelf commented Sep 16, 2022

because there are a few each*s

julia> each

eachcol    eachindex  eachline   eachmatch  eachrow    eachslice  eachsplit

@JeffBezanson JeffBezanson removed the needs news A NEWS entry is required for this change label Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add version of split(str) that returns iterator
9 participants