RFC: `eachsplit` for iterative splitting #39245

anaveragehuman · 2021-01-14T03:24:58Z

Fixes #20603, replacing/closes #20688, closes #7027.

We might also want an eachrsplit to match rsplit.

Initial benchmarks show this being more expensive than the existing split. Advice on how to improve this would be appreciated.

Existing implementation:

julia> @benchmark split("α β γ", " ")
BenchmarkTools.Trial:
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     267.648 ns (0.00% GC)
  median time:      281.473 ns (0.00% GC)
  mean time:        306.094 ns (2.61% GC)
  maximum time:     7.298 μs (95.83% GC)
  --------------
  samples:          10000
  evals/sample:     315

julia> @btime sum(length, split($"The quick brown fox jumps over the lazy dog."))
  876.213 ns (4 allocations: 800 bytes)
36

This PR:

@benchmark split("α β γ", " ")
BenchmarkTools.Trial:
  memory estimate:  384 bytes
  allocs estimate:  5
  --------------
  minimum time:     285.511 ns (0.00% GC)
  median time:      301.029 ns (0.00% GC)
  mean time:        333.977 ns (5.43% GC)
  maximum time:     10.054 μs (95.45% GC)
  --------------
  samples:          10000
  evals/sample:     276

julia> @btime sum(length, split($"The quick brown fox jumps over the lazy dog."))
  1.116 μs (13 allocations: 1.34 KiB)
36

julia> @btime sum(length, eachsplit($"The quick brown fox jumps over the lazy dog."))
  813.111 ns (9 allocations: 576 bytes)
36

base/strings/util.jl

vtjnash

LGTM, with minor comments

anaveragehuman · 2021-04-03T21:43:33Z

Thanks for reviewing! I'm able to achieve 0 allocations with iteration using a loop, but there's still a noticeable overhead compared to the previous split implementation, some of which seems to be added by Iterators.filter when keepempty=false.

I've added another commit to use eachsplit instead of split where possible throughout Base so those functions can benefit from the speedup and smaller memory footprint.

Latest master:

julia> @benchmark split("α β γ", " ")
BenchmarkTools.Trial:
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     236.418 ns (0.00% GC)
  median time:      240.995 ns (0.00% GC)
  mean time:        257.252 ns (2.55% GC)
  maximum time:     4.492 μs (94.14% GC)
  --------------
  samples:          10000
  evals/sample:     421

julia> @btime sum(length, split($"The quick brown fox jumps over the lazy dog."))
  878.023 ns (4 allocations: 800 bytes)
36

This PR:

julia> @benchmark split("α β γ", " ")
BenchmarkTools.Trial:
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     255.710 ns (0.00% GC)
  median time:      269.169 ns (0.00% GC)
  mean time:        283.707 ns (2.81% GC)
  maximum time:     6.288 μs (94.93% GC)
  --------------
  samples:          10000
  evals/sample:     355

julia> @btime sum(length, split($"The quick brown fox jumps over the lazy dog."))
  1.052 μs (13 allocations: 1.34 KiB)
36

julia> @btime sum(length, split("The quick brown fox jumps over the lazy dog."; keepempty=true))
  967.529 ns (4 allocations: 800 bytes)
36

julia> @btime sum(length, eachsplit($"The quick brown fox jumps over the lazy dog."))
  752.438 ns (0 allocations: 0 bytes)
36

base/strings/util.jl

anaveragehuman · 2021-04-27T21:26:11Z

I couldn't figure out why Iterators.filter allocates so I moved the filtering into the function. Looks like performance is pretty close to the current implementation now.

Current master:

julia> @benchmark split("α β γ", " ")
BenchmarkTools.Trial:
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     268.547 ns (0.00% GC)
  median time:      275.657 ns (0.00% GC)
  mean time:        291.221 ns (2.48% GC)
  maximum time:     6.723 μs (95.40% GC)
  --------------
  samples:          10000
  evals/sample:     318

julia> @btime sum(length, split($"The quick brown fox jumps over the lazy dog."))
  908.389 ns (4 allocations: 800 bytes)
36

This PR:

julia> @benchmark split("α β γ", " ")
BenchmarkTools.Trial:
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     243.090 ns (0.00% GC)
  median time:      247.575 ns (0.00% GC)
  mean time:        263.489 ns (2.89% GC)
  maximum time:     5.694 μs (94.25% GC)
  --------------
  samples:          10000
  evals/sample:     398

julia> @btime sum(length, split($"The quick brown fox jumps over the lazy dog."))
  887.864 ns (4 allocations: 800 bytes)
36

julia> @btime sum(length, eachsplit($"The quick brown fox jumps over the lazy dog."))
  627.118 ns (0 allocations: 0 bytes)
36

cmcaine · 2021-09-06T13:41:37Z

The "needs news" label is inaccurate now.

This looks good to go? Maybe triage can make a decision on it and report here?

This commit moves the existing splitting implementation into an iterator named `eachsplit` and changes the definition of `split(...)` to `collect(eachsplit(...))`, plus a few edge cases.

anaveragehuman · 2021-09-06T16:26:27Z

Rebased again. I'm not completely happy about the new split being a little slower, but I suppose the takeaway is to use the iterating version if performance is important.

Current master:

julia> @benchmark split("α β γ", " ")
BenchmarkTools.Trial: 10000 samples with 405 evaluations.
 Range (min … max):  240.978 ns …   5.592 μs  ┊ GC (min … max): 0.00% … 93.55%
 Time  (median):     264.570 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   292.964 ns ± 199.246 ns  ┊ GC (mean ± σ):  3.05% ±  4.45%

   ▂▄██▇▆▃▃▂▂▂▃▂▂▁                                              ▂
  ▇████████████████▇▇▇▆▆▆▆▆▅▅▆▅▅▄▅▆▅▆▆▆▆▆▇▇▆▇▆▇▇▆▇▆▇▇▇▇▇▆▆▆▆▆▆▇ █
  241 ns        Histogram: log(frequency) by time        554 ns <

 Memory estimate: 272 bytes, allocs estimate: 2.

julia> @btime sum(length, split($"The quick brown fox jumps over the lazy dog."))
  958.864 ns (3 allocations: 1.25 KiB)
36

This PR:

julia> @benchmark split("α β γ", " ")
BenchmarkTools.Trial: 10000 samples with 340 evaluations.
 Range (min … max):  258.824 ns …   8.959 μs  ┊ GC (min … max): 0.00% … 95.93%
 Time  (median):     289.379 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   316.502 ns ± 276.307 ns  ┊ GC (mean ± σ):  3.61% ±  4.03%

  ▁ ▄▃▇▂█▃▃▃▃▂▂▃▂▂▂                                             ▁
  █▆██████████████████▇▆▇▇▇▆▆▄▅▆▆▅▆▆▆▅▅▅▆▆▆▆▇▆▆▇▇▆▆▆▆▇▆▇▇▆▇▆▇▆▆ █
  259 ns        Histogram: log(frequency) by time        557 ns <

 Memory estimate: 272 bytes, allocs estimate: 2.

julia> @btime sum(length, split($"The quick brown fox jumps over the lazy dog."))
  981.739 ns (3 allocations: 1.25 KiB)
36

julia> @btime sum(length, eachsplit($"The quick brown fox jumps over the lazy dog."))
  721.389 ns (0 allocations: 0 bytes)
36

Moelf · 2021-09-06T21:57:28Z

also needs test?

cmcaine · 2021-09-08T09:03:26Z

This PR implements split with eachsplit and uses eachsplit in a few other places in Base, so it's kind of already covered by the existing tests. Not sure it needs any more?

sostock · 2021-09-09T06:28:52Z

Would it make sense to put this into Base.Iterators (as Iterators.split), since Base.Iterators defines lazy versions Base functions (cf. map/Iterators.map or filter/Iterators.filter)?

Also, it would be good to add a compat admonition to the docstring (“This function requires at least Julia 1.8”).

Somehow type constraints from the complex `while` condition don't propagate to the `while` body.

stevengj · 2021-09-13T02:52:02Z

Might be nice to have reverse-iteration support, i.e. to implement iterate(r::Iterators.Reverse{SplitIterator}, state...).

(That would also give you e.g. last(eachsplit(...), n).)

This moves the existing splitting implementation into an iterator named `eachsplit` and changes the definition of `split(...)` to `collect(eachsplit(...))`, plus a few edge cases.

jariji · 2022-09-16T17:52:10Z

Inspired by a question on Zulip, I'll ask, why is this function called eachsplit rather than Iterators.split? What is "each split" supposed to mean?

Moelf · 2022-09-16T18:48:39Z

because there are a few each*s

julia> each

eachcol    eachindex  eachline   eachmatch  eachrow    eachslice  eachsplit

vtjnash reviewed Apr 3, 2021

View reviewed changes

base/strings/util.jl Outdated Show resolved Hide resolved

vtjnash reviewed Apr 3, 2021

View reviewed changes

base/strings/util.jl Outdated Show resolved Hide resolved

vtjnash approved these changes Apr 3, 2021

View reviewed changes

anaveragehuman force-pushed the eachsplit branch from 9132a21 to bacefa6 Compare April 3, 2021 21:43

anaveragehuman force-pushed the eachsplit branch from bacefa6 to 94e3e09 Compare April 3, 2021 23:07

vtjnash reviewed Apr 19, 2021

View reviewed changes

base/strings/util.jl Outdated Show resolved Hide resolved

vtjnash mentioned this pull request Apr 19, 2021

WIP: Add an iterator implementation for String splitting #20688

Closed

anaveragehuman force-pushed the eachsplit branch from 4b2d6c9 to 3a4d4b1 Compare April 27, 2021 20:21

fredrikekre added strings "Strings!" needs news A NEWS entry is required for this change labels Apr 28, 2021

anaveragehuman force-pushed the eachsplit branch from 3a4d4b1 to c9fe96d Compare May 30, 2021 16:33

anaveragehuman added 2 commits September 6, 2021 12:23

strings: add eachsplit for iterative splitting

9bdd43e

This commit moves the existing splitting implementation into an iterator named `eachsplit` and changes the definition of `split(...)` to `collect(eachsplit(...))`, plus a few edge cases.

base: use eachsplit where possible

3f1801d

anaveragehuman force-pushed the eachsplit branch from c9fe96d to 3f1801d Compare September 6, 2021 16:23

vtjnash merged commit dba8a08 into JuliaLang:master Sep 8, 2021

anaveragehuman deleted the eachsplit branch September 9, 2021 02:50

aviatesk added a commit that referenced this pull request Sep 10, 2021

follow #39245, improve inferrability of iterate(::SplitIterator)

dcc2182

Somehow type constraints from the complex `while` condition don't propagate to the `while` body.

anaveragehuman mentioned this pull request Sep 26, 2021

strings: add compat annotation for eachsplit(). #42391

Merged

longemen3000 mentioned this pull request Oct 29, 2021

add eachsplit JuliaLang/Compat.jl#761

Merged

stevengj mentioned this pull request Jan 10, 2022

reverse iteration for eachsplit #43740

Open

JeffBezanson removed the needs news A NEWS entry is required for this change label Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: `eachsplit` for iterative splitting #39245

RFC: `eachsplit` for iterative splitting #39245

anaveragehuman commented Jan 14, 2021 •

edited

Loading

vtjnash left a comment

anaveragehuman commented Apr 3, 2021

anaveragehuman commented Apr 27, 2021

cmcaine commented Sep 6, 2021

anaveragehuman commented Sep 6, 2021

Moelf commented Sep 6, 2021

cmcaine commented Sep 8, 2021

sostock commented Sep 9, 2021

stevengj commented Sep 13, 2021 •

edited

Loading

jariji commented Sep 16, 2022

Moelf commented Sep 16, 2022 •

edited

Loading

RFC: eachsplit for iterative splitting #39245

RFC: eachsplit for iterative splitting #39245

Conversation

anaveragehuman commented Jan 14, 2021 • edited Loading

vtjnash left a comment

Choose a reason for hiding this comment

anaveragehuman commented Apr 3, 2021

anaveragehuman commented Apr 27, 2021

cmcaine commented Sep 6, 2021

anaveragehuman commented Sep 6, 2021

Moelf commented Sep 6, 2021

cmcaine commented Sep 8, 2021

sostock commented Sep 9, 2021

stevengj commented Sep 13, 2021 • edited Loading

jariji commented Sep 16, 2022

Moelf commented Sep 16, 2022 • edited Loading

RFC: `eachsplit` for iterative splitting #39245

RFC: `eachsplit` for iterative splitting #39245

anaveragehuman commented Jan 14, 2021 •

edited

Loading

stevengj commented Sep 13, 2021 •

edited

Loading

Moelf commented Sep 16, 2022 •

edited

Loading