Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add rtruncate, ltruncate, ctruncate for truncating strings #55351

Merged
merged 8 commits into from
Aug 8, 2024

Conversation

IanButterworth
Copy link
Member

Co-authored with @tecosaur

Looking at the implementation of these functions in Profile (which are touched in #55335) uncovered issues and it seems to make sense to make and test the proper functions and add them to Base.

@IanButterworth IanButterworth added the strings "Strings!" label Aug 2, 2024
@oscardssmith
Copy link
Member

oscardssmith commented Aug 2, 2024

Isn't this font and terminal emulator dependent? (e.g. 🏳️‍🌈 is 1 wide in a browser but in my terminal emulator, displaying as 🏳️🌈)

@nsajko nsajko added the feature Indicates new feature / enhancement requests label Aug 2, 2024
@IanButterworth
Copy link
Member Author

I'm far from an expert on this, but isn't that lack of adherence to the standard by the terminal, not variability that should be handled?

@nsajko
Copy link
Contributor

nsajko commented Aug 2, 2024

isn't that lack of adherence to the standard by the terminal

In any case, textwidth seems to follow the behavior in the terminal. So textwidth itself, and thus these proposed functions, are really specific to terminal, or at least fixed-width usage?

@oscardssmith
Copy link
Member

I'm far from an expert on this, but isn't that lack of adherence to the standard by the terminal, not variability that should be handled?

no. Quoting https://www.unicode.org/emoji/charts/emoji-zwj-sequences.html:

The following are the recommended emoji zwj sequences, which use a U+200D ZERO WIDTH JOINER (ZWJ) to join the characters into a single glyph if available. When not available, the ZWJ characters are ignored and a fallback sequence of separate emoji is displayed. Thus an emoji zwj sequence should only be supported where the fallback sequence would also make sense to a viewer.

@IanButterworth
Copy link
Member Author

These are adjacent functions to rpad/lpad and it was decided to make those textwidth-based, so isn't the point moot? #38256

NEWS.md Outdated Show resolved Hide resolved
@stevengj
Copy link
Member

stevengj commented Aug 4, 2024

(It definitely depends on the terminal; see the classic issue #3721 — but alternatives like length are even worse.)

base/strings/util.jl Outdated Show resolved Hide resolved
base/strings/util.jl Outdated Show resolved Hide resolved
base/strings/util.jl Outdated Show resolved Hide resolved
base/strings/util.jl Outdated Show resolved Hide resolved
base/strings/util.jl Outdated Show resolved Hide resolved
@stevengj
Copy link
Member

stevengj commented Aug 4, 2024

Do we really need all three of these in the stdlib? ctrunc seems especially esoteric.

@tecosaur
Copy link
Contributor

tecosaur commented Aug 4, 2024

I see two arguments for this functionality finding its way into Base.

  • With respect to fitting a string to a certain width, it's the other half of lpad/rpad
  • A naive implementation of truncation is prone to bugs, even appearing in the stdlib (see: the bugs I picked up in Profile that Ian referenced)

With regards to why ctrunc should be included at all, I'd consider centered truncation to be similarly common to ltrunc / rtrunc. For example, it's what string's 3-arg show does: "abc..." ⋯ 326 bytes ⋯ "...xyz".

While cpad is easy to implement based on lpad and rpad:

cpad(str, w) = lpad(rpad(str, w ÷ 2), w)

and this was given as the reason why cpad was depreciated in 2017 (#23187).

Unlike cpad, ctrunc is not trivially formed from a combination of ltrunc and rtrunc, and shares all of the same risks with naive implementations. On this basis, I see it as a worthwhile inclusion.

@tecosaur
Copy link
Contributor

tecosaur commented Aug 4, 2024

Oh also, I've just discovered a generalised truncation function I implemented in About.jl 😆 (that would have been helpful while Ian and I repeated the work in our DMs on Slack)

function struncate(str::AbstractString, maxwidth::Int, joiner::AbstractString = "", mode::Symbol = :center)
    textwidth(str) <= maxwidth && return str
    left, right = firstindex(str), lastindex(str)
    width = textwidth(joiner)
    while true
        if mode  (:right, :center)
            (width += textwidth(str[left])) <= maxwidth || break
            left = nextind(str, left)
        end
        if mode  (:left, :center) && width < maxwidth
            (width += textwidth(str[right])) <= maxwidth || break
            right = prevind(str, right)
        end
    end
    str[begin:prevind(str, left)] * joiner * str[nextind(str, right):end]
end

@IanButterworth
Copy link
Member Author

A single function like that seems tidier, but is it too atypical to put key functionality on a kwarg? Otherwise we might have had pad(.., mode = :right).

@stevengj
Copy link
Member

stevengj commented Aug 5, 2024

A single function like that seems tidier, but is it too atypical to put key functionality on a kwarg?

We could always use a single-function implementation internally (maybe with a Val argument?) but still expose ltrunc etc as the API.

@IanButterworth
Copy link
Member Author

Indeed. Just updated to the single internal implementation

ctrunc(str::AbstractString, maxwidth::Int, replacement::Union{AbstractString,Char} = '…'; prefer_left::Bool = true) =
struncate(str, maxwidth, replacement, :center, prefer_left)

function struncate(str::AbstractString, maxwidth::Int, replacement::Union{AbstractString,Char}, mode::Symbol = :center, prefer_left::Bool = true)
Copy link
Member

@stevengj stevengj Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function struncate(str::AbstractString, maxwidth::Int, replacement::Union{AbstractString,Char}, mode::Symbol = :center, prefer_left::Bool = true)
function _struncate(str::AbstractString, maxwidth::Int, replacement::Union{AbstractString,Char}, ::Val{mode}, prefer_left::Bool = true) where {mode}

use an underscore to indicate that this is an internal method, and use Val so that the mode checks can be elided by the compiler?

(Especially since the mode checks occur in the inner loop.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have public we don't need to mark internals by name convention?

Val sounds good. I didn't appreciate why before.

Copy link
Member

@KristofferC KristofferC Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Val and not e.g. a Symbol keyword arg?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have public we don't need to mark internals by name convention?

because the REPL autocompletes and even gives hints for private names: #51327

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That change just needs implementing, right? That can happen in time for 1.12 along with this code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure that the mode == ... conditionals are evaluated at compile time.

This should be backed up by some benchmark in my opinion. A single pointer equality check seems to me insignificant from the work of the function.

In principle this could also happen with enum or Symbol arguments if we ensure that the function is inlined.

I don't think constant propagation requires inlining. And there is also @constprop :aggressive.

Val is never really a good idea to make an API upon (any longer).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be backed up by some benchmark in my opinion. A single pointer equality check seems to me insignificant from the work of the function.

I tried this with

s = randstring(10^5)
@btime struncate($s, 100, '', :center);

and I got about an 8% speedup by hard-coding the mode checks. (This is after doing the total_width optimization noted below — otherwise, it spends all its time on the textwidth(str) check.)

Copy link
Member

@stevengj stevengj Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get another 60% speedup by replacing textwidth(::Char) with an optimized version that has a fast path for ASCII characters. We might want to do this in a separate PR:

const textwidth_ascii = Cint[textwidth(Char(i)) for i = 0:127]
function textwidth2(c::AbstractChar)
    b = bswap(reinterpret(UInt32, c))
    b < 0x80 && @inbounds return Int(textwidth_ascii[b+1]) # ASCII fast path
    Base.ismalformed(c) && return 1
    Int(ccall(:utf8proc_charwidth, Cint, (UInt32,), c))
end

Update: this optimization is #55398

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Val I do see reduction from

julia> @b ctrunc("dsadsafsadsadsafdsad", 10)
150.062 ns (4 allocs: 112 bytes)

to

136.986 ns (4 allocs: 112 bytes)

That's without #55398

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #55398 this is now

julia> @b ctruncate("dsadsafsadsadsafdsad", 10)
91.665 ns (4 allocs: 112 bytes)

base/strings/util.jl Outdated Show resolved Hide resolved
base/strings/util.jl Outdated Show resolved Hide resolved
base/strings/util.jl Outdated Show resolved Hide resolved
@nsajko
Copy link
Contributor

nsajko commented Aug 6, 2024

Could the exported functions get more descriptive names? E.g., truncate_from_left. Or even truncate_string_from_left.

@IanButterworth
Copy link
Member Author

@nsajko these are supposed to be the counterparts to to rpad/lpad. I think the names make sense with that context?

@nsajko
Copy link
Contributor

nsajko commented Aug 6, 2024

IMO, given that using Base is the default for modules, the less names are exported from Base, cluttering non-bare modules, the better. Especially if the names are very short or not used especially often. Not sure how prevalent this opinion is, but there it is.

So, IMO, in order of preference:

  1. descriptive public names
  2. descriptive export-ed names or shorthand public names
  3. shorthand export-ed names.

FWIW.

@IanButterworth
Copy link
Member Author

Ok. Fair.

truncate_right? Allows extending methods for non string types.

base/strings/util.jl Outdated Show resolved Hide resolved
base/strings/util.jl Outdated Show resolved Hide resolved
base/strings/util.jl Outdated Show resolved Hide resolved
base/exports.jl Outdated
@@ -632,6 +634,7 @@ export
rpad,
rsplit,
rstrip,
rtrunc,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsajko Regarding renaming. I'm wavering. We have rpad rsplit rstrip, lpad lstrip.
Moving to truncate_right etc. feels a bit awkward for something so closely tied.

What about rtruncate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, rtruncate is much better, I vote for rtruncate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do generally like unabbreviated names (nextind continues to annoy me).

However... I do wonder if there's a danger in ltruncate (& co.) being very similar in name to truncate yet doing something completely different?

base/strings/util.jl Outdated Show resolved Hide resolved
@IanButterworth IanButterworth changed the title add rtrunc, ltrunc, ctrunc for truncating strings add rtruncate, ltruncate, ctruncate for truncating strings Aug 7, 2024
@tecosaur
Copy link
Contributor

tecosaur commented Aug 7, 2024

Not to open a can of worms, but it occurs to me that in cases like:

julia> struncate("🍕🍕🍕🍕🍕🍕xxxxxxxxxxx", 9, "")
"🍕🍕🍕…xx"

The "centre" truncate call has 6 cells of 🍕 on the left, and 2 of x, which isn't very centred 😬.

(Sorry Ian 😅)

@IanButterworth
Copy link
Member Author

Just pushed a fix

julia> @time ctruncate("🍕🍕🍕🍕🍕🍕xxxxxxxxxxx", 9)
  0.000014 seconds (4 allocations: 112 bytes)
"🍕🍕…xxxx"

@IanButterworth IanButterworth merged commit e439836 into JuliaLang:master Aug 8, 2024
5 of 7 checks passed
@IanButterworth IanButterworth deleted the ib/string_trunc branch August 8, 2024 13:42
lazarusA pushed a commit to lazarusA/julia that referenced this pull request Aug 17, 2024
…erms of `textwidth` (JuliaLang#55351)

Co-authored-by: Timothy <[email protected]>
Co-authored-by: Steven G. Johnson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Indicates new feature / enhancement requests strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants