Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update character width tables according to Unicode 9 #294

Closed
wants to merge 2 commits into from

Conversation

Keno
Copy link

@Keno Keno commented Jun 24, 2016

Generated from the updated EastAsianWidth.txt.

Generated from the updated EastAsianWidth.txt using the following
script:

```
fullwidth = IOBuffer()
fullwidthsingle = IOBuffer()
ambiguouswidth = IOBuffer()
ambiguouswidthsingle = IOBuffer()

function print_to_correct_buffer(rangebuf, singlebuf, range, str)
    if length(range) == 1
        println(singlebuf, "[$str addCharactersInRange:NSMakeRange(0x$(hex(first(range))), 1)];")
    else
        f, l = hex(first(range)), hex(last(range))
        println(rangebuf, "[$str addCharactersInRange:NSMakeRange(0x$f, 0x$l - 0x$f + 1)];")
    end
end

ranges = Any[]
for line in readlines(open("EastAsianWidth.txt"))
    #Strip comments
    line[1] == '#' && continue
    precomment = split(line, '#')[1]
    #Parse code point range and width code
    tokens = split(precomment, ';')
    length(tokens) >= 2 || continue
    charrange = tokens[1]
    width = strip(tokens[2])
    #Parse code point range into Julia UnitRange
    rangetokens = split(charrange, "..")
    charstart = parse(UInt32, "0x"*rangetokens[1])
    charend = parse(UInt32, "0x"*rangetokens[length(rangetokens)>1 ? 2 : 1])
    range = charstart:charend

    # Coalesce ranges
    if !isempty(ranges) && ranges[end][1] == width && last(ranges[end][2]) == first(range)-1
        ranges[end] = (width, first(ranges[end][2]):last(range))
    else
        push!(ranges, (width, range))
    end
end

for (width, range) in ranges
    if width=="W" || width=="F" # wide or full
        print_to_correct_buffer(fullwidth, fullwidthsingle, range, "sFullWidth")
    elseif width == "A"
        print_to_correct_buffer(ambiguouswidth, ambiguouswidthsingle, range, "sAmbiguousWidth")
    end
end
```
@Keno
Copy link
Author

Keno commented Jun 25, 2016

@gnachman I've updated the tests as much as I knew how to (for some reason it seems like a much larger number of tests fails for me locally, so I can't really reproduce this failure). Not sure what the remaining complication in the emoji test is.

@gnachman
Copy link
Owner

This is awesome! I'm glad to see some of the fixes for Emoji.

The Golden tests are a pain to work with because every machine renders text slightly differently.

The big question is when this should be enabled for the world. Should we support multiple versions of the width table?

@Keno
Copy link
Author

Keno commented Jun 29, 2016

Given that working with emoji is pretty broken without this, I don't think there should be too much of a problem of just activating it immediately, but it's your call of course.

@gnachman
Copy link
Owner

Non-interactive use (e.g., cat emoji.txt) is much better with Unicode 9 but both emacs and bash get totally confused (the cursor position does not correspond with the where edits will actually occur).

Here's a screen recording demonstrating the craziness: https://iterm2.com/misc/Unicode9Bugs.mov

With the Unicode 8 tables the emoji overlap each other and it's ugly but at least it's possible to edit.

Different programs will get updated at different times, meaning everything's going to be broken for a while as far as emoji width goes.

I think it should be an off-by-default option for now and I'll flip it on by default when there's more adoption.

I'll make a note to merge this but put it behind an advanced pref.

@Keno
Copy link
Author

Keno commented Jun 29, 2016

Could there be a proprietary escape code for a program to declare that it is unicode 9 aware?

@gnachman
Copy link
Owner

Yes. It would be awesome if other terminals could standardize on something. Let me reach out to Thomas Dickey and see what he thinks.

@gnachman
Copy link
Owner

Looks like xterm doesn't support emoji, so scratch that idea. We should just invent something and maybe others will follow.

@asmeurer
Copy link
Contributor

Do the bugs you mention also exist for East Asian characters, like コンニチハ? For me, in iTerm2, both in bash and in emacs, it seems to work just fine, except for an issue where if I select a character the right half is not inverse-videoed correctly.

I would argue to make the change immediately. It could be tough for some terminal applications that use emoji, but being double width is the correct behavior, as sanctioned by the Unicode standards.

I don't see the point of a proprietary escape code. If an application is aware of iTerm2, couldn't it just use whatever mechanism gets the iTerm2 version (I forgot how that works) and decide how to render emoji based on that. Or better yet, just print emoji the "correct" way (no extra space after each emoji to keep them from overlapping), and require users to use an up-to-date terminal emulator to get the best formatting.

@asmeurer
Copy link
Contributor

I am unable to insert or paste emoji into bash. It just strips them from the text. How were you able to do that in your video?

@asmeurer
Copy link
Contributor

OK, I built this branch and I see the issue now (for some reason, I can't insert emoji in iTerm2 on my other machine; maybe I had a broken nightly, or maybe it's because this machine has Sierra?).

The issue is the mismatch between the wcwidth (or equivalent) of the terminal application, and what iTerm2 thinks is happening.

Sadly, you get equally broken behavior if wcwidth thinks that emoji should be double width, but iTerm doesn't. Here is a video using xonsh (which uses the wcwidth Python library, which as of 0.1.7 uses Unicode 9.0) and 3.0.20160720-nightly.

So things are broken either way, unless both iTerm2 and the underlying application agree on how wide an emoji is.

Hence, my argument is, allow to change the behavior (maybe with options to change only for certain applications), but by default, do the Unicode 9.0 behavior. That way, at least iTerm2 is pushing the other applications in the right direction (most terminal applications see the emulator as a source of truth when it comes to ambiguity anyway).

@gnachman
Copy link
Owner

gnachman commented Aug 4, 2016

I like pushing things forward but I also hate having a giant queue of bugs filed by confused users :)

What I'd like to do is to define a new escape sequence to set the unicode version. At least this way it'll be possible to support both and people who know what they're doing can configure their apps to switch versions.

@gnachman
Copy link
Owner

gnachman commented Aug 8, 2016

Merged (mostly). Please see this page for info: https://gitlab.com/gnachman/iterm2/wikis/unicodeversionswitching

Thanks for getting the ball rolling on this, @Keno , and for being on the right side of history, @asmeurer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants