-
Notifications
You must be signed in to change notification settings - Fork 8.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C1 control characters detection breaks output (regression) #10310
Comments
I think #7854 (comment) explains this. |
@skyline75489 has the right of it. Those characters are unspecified in 1252. They are control characters after f91b53d, but they were control characters before that, too. What is a well-meaning application doing printing things outside of its codepage’s codepoint coverage? |
The application basically outputs a file picked by the user using a codepage picked by the user. If that behaviour is now by design - ok. |
A few more thoughts:
Don't ask "but why?" - users can, so they will. Sanitising file names, everywhere, even in scenarios not related to outputting anything for the sake of the feature that is "almost never used"... 🤔 |
Note that your test case won't be interpreted as control characters in conhost (even the very latest build), because you have to have the That said, the cmd shell does enable VT mode, so control characters in a filename could be an issue there. Somebody already raised that in issue #10069, which I misdiagnosed as a conpty problem, but I've just checked with a recent OpenConsole build and can reproduce the issue there too. So that issue should probably be reopened - it's not a dup of #4363. However, note that this has been an issue long before PR #7340, because we already supported the 8-bit CSI control before then. PR #7340 just added more controls. |
At least two more users are affected by this behavior. |
It seems like BurntSushi/ripgrep#1992 is another instance of "an application is printing UTF-8 to the screen without setting the console codepage to UTF-8, or converting internally and printing it as UTF-16." |
I'm aware of In any case, I want to run commands without escape codes causing cut off lines and random escape characters appearing on the next zsh prompt. This escape code behavior is not occurring in the default WSL terminal (via conhost, I believe), where I see instead Update: piping
I replaced This snippet is displayed in vim:
Moving the cursor around causes all sorts of trouble. Moving the cursor down, line-by-line, from the top line to the bottom without entering insert mode:
On further inspection, this same file shows up screwy in conhost, too, so vim's handling of UTF-8 could be imperfect. Also, |
Hmm. I let my assumptions cloud my interpretation of the linked issue and missed that this was under WSL. I'm sorry. conhost will unfortunately exhibit this behavior on Windows 11 (in before I've accidentally missed that detail too) or a later update, since the code in this repo is the code for conhost and huge swaths of the terminal emulation code are straight-up shared. I'll have to come back to the vim bits after the weekend for a deeper investigation. |
I've gotta link this file I'm using as a test for reference: https://github.com/vim/vim/blob/master/src/testdir/test_regexp_utf8.vim I'm not even as much worried about the proper character display as much as the Also, sorry for hijacking @alabuzhev 's thread |
That's a response to the |
@PennRobotics FYI, this particular case has got nothing to do with C1 controls. Lines 1, 2 and 5 have got combining characters or non-spacing marks, that we are probably not handling correctly, or at least not in the same way as vim. Lines 3 and 4 are all just invalid code points in UTF-8, which I guess could also be an issue if there is a discrepancy in the way vim and WT handle the erroneous values. I'm not seeing the content changing when moving the cursor up and down in vim, but maybe that depends on the version or other configuration differences. It certainly wouldn't surprise me if it did something weird with those characters though. In any event, I think these problems are possibly more on topic in issue #8000, but @DHowett can correct me on that. |
@j4james Sure enough, Unrelated/unimportant: There's still another stray character sequence that---when running |
@PennRobotics That's because full C1 support was only added in PR #7340 and assumedly hasn't made its way into the inbox conhost yet, or at least not the version you're using. But |
Is there an easy way to universally disable the aspect of C1 support that is putting characters into the input buffer? An environmental var? I understand this could be the side effects of an exciting new/upcoming feature, but I can't say I would want the old terminal to start doing what this newer Terminal does---prefixing my next command with some gibberish. I do have the use case where I need to grep a mixed-ASCII-and-binary file (.elf w/ symbols and strings, .pdf, etc.) as part of a larger codebase search, so this has become a rare but occasional annoyance. (I also understand that I'm in a minority of users, and the workaround of pressing Ctrl+U to clear the input line is much, much faster than any alternatives that I can conceive.) |
If you're outputting raw control characters to the screen, there is always the possibility you're going to trigger a query response, even if you could disable C1 support. You'd really need to disable all VT processing entirely. So if you're using grep on binary files, I'd suggest piping the output through |
Eww. In time, I'll find a good alternative to It doesn't make sense why a query response shows up at the prompt as the next command. In contrast, something like If you don't delete the query response's 1; from the next prompt, you'll jump to the last pushed directory and run a (likely) invalid command. It's not a major bug, but it is an annoyance. I fail to understand why the control character must become part of stdin instead of stdout, but even wikipedia indicates a terminal can generate sequences that seem to come from the user. However, for the sake of your time and my time, I have no more interest in this subject and reserve hope that stray characters never land in the input buffer at some point in the future. Thanks for sharing all this reasonably obscure info on terminal nuances. |
Piping to |
@DHowett If we want to try and do something about this, I have a proposal that I think might make most people happy.
Hopefully this will cut down on the bug reports, without actually losing any significant functionality. It's also worth mentioning that XTerm doesn't support C1 controls in UTF-8 either, so it's unlikely to cause compatibility issues with Linux apps. While there are a few Linux terminals that do support UTF-8 C1 (VTE being the most well known), I think they're probably in the minority. Anyway, I don't feel that strongly about this either way, but I'd be happy to put together a PR if you like the idea. |
@j4james I'm totally on board with this proposal. I love it. The only thing that gives me pause (and it is not enough pause for me to care) is that I think we supported one C1 control when we initially went open-source, and I somewhat wondered if there was a reason we chose to support that one. It's likely a "this one seems common, maybe we should support it we guess?" situation. |
I'll probably prepare this one for isolated ingestion into Windows so it can be released in a servicing update. 😄 Folks may eventually be more broadly upset that conhost is "acting weird" and "displaying corrupt text" and "stole my lunch." |
At the time you went open-source, there were only 5 supported sequences that could potentially have been implemented as C1 as far as I could see: That leaves What's giving me pause now, though, is I just went to dig up an issue I remembered from the VTE tracker, where they were considering removing their C1 handling, thinking it would support this decision, but they ultimately chose not to (this was issue 209). Ironically they cited the Windows support of C1 as one of the reasons for keeping it. It wasn't just the fact that they chose to keep it, though, but they linked to a bug report in Alacritty where Fedora was using an OSC sequence with a C1 control, and since Alacritty didn't support C1, it got a bunch of garbage output on the screen. So that is a situation we could end up facing as well. That said, I think this particular case only arose because Alacritty was started from a VTE shell, and was misrecognized as VTE because the VTE_VERSION environment variable was set. So I still think we are more likely to get bug reports from having C1 enabled by default, than we will if we remove it, but I don't want to pretend there are no downsides. |
Guys, any recommendations which characters should be used for C1 replacement? |
There are some code pages with "unmapped" code points in the C1 range, which results in them being translated into Unicode C1 control codes, even though that is not their intended use. To avoid having these characters triggering unintentional escape sequences, this PR now disables C1 controls by default. Switching to ISO-2022 encoding will re-enable them, though, since that is the most likely scenario in which they would be required. They can also be explicitly enabled, even in UTF-8 mode, with the `DECAC1` escape sequence. What I've done is add a new mode to the `StateMachine` class that controls whether C1 code points are interpreted as control characters or not. When disabled, these code points are simply dropped from the output, similar to the way a `NUL` is interpreted. This isn't exactly the way they were handled in the v1 console (which I think replaces them with the font _notdef_ glyph), but it matches the XTerm behavior, which seems more appropriate considering this is in VT mode. And it's worth noting that Windows Explorer seems to work the same way. As mentioned above, the mode can be enabled by designating the ISO-2022 coding system with a `DOCS` sequence, and it will be disabled again when UTF-8 is designated. You can also enable it explicitly with a `DECAC1` sequence (originally this was actually a DEC printer sequence, but it doesn't seem unreasonable to use it in a terminal). I've also extended the operations that save and restore "cursor state" (e.g. `DECSC` and `DECRC`) to include the state of the C1 parser mode, since it's closely tied to the code page and character sets which are also saved there. Similarly, when a `DECSTR` sequence resets the code page and character sets, I've now made it reset the C1 mode as well. I should note that the new `StateMachine` mode is controlled via a generic `SetParserMode` method (with a matching API in the `ConGetSet` interface) to allow for easier addition of other modes in the future. And I've reimplemented the existing ANSI/VT52 mode in terms of these generic methods instead of it having to have its own separate APIs. ## Validation Steps Performed Some of the unit tests for OSC sequences were using a C1 `0x9C` for the string terminator, which doesn't work by default anymore. Since that's not a good practice anyway, I thought it best to change those to a standard 7-bit terminator. However, in tests that were explicitly validating the C1 controls, I've just enabled the C1 parser mode at the start of the tests in order to get them working again. There were also some ANSI mode adapter tests that had to be updated to account for the fact that it has now been reimplemented in terms of the `SetParserMode` API. I've added a new state machine test to validate the changes in behavior when the C1 parser mode is enabled or disabled. And I've added an adapter test to verify that the `DesignateCodingSystems` and `AcceptC1Controls` methods toggle the C1 parser mode as expected. I've manually verified the test cases in #10069 and #10310 to confirm that they're no longer triggering control sequences by default. Although, as I explained above, the C1 code points are completely dropped from the output rather than displayed as _notdef_ glyphs. I think this is a reasonable compromise though. Closes #10069 Closes #10310
🎉This issue was addressed in #11690, which has now been successfully released as Handy links: |
Conhost/Terminal supported C1 control sequences for a while... but then that apparently caused some problems. So they disabled them by default, which causes all our VT SGR sequences to be broken (so instead of pretty color output, you just see gray output, with strange numbers sprinkled all over). Fortunately they provided a way to turn them back on. Related: microsoft/terminal#11690 Related: microsoft/terminal#10310 Note that I believe you need a relatively recent build for this change to have effect.
Windows Terminal version (or Windows build number)
1.9.1445.0
Other Software
No response
Steps to reproduce
Compile and run the following code:
Expected Behavior
€�‚ƒ„…†‡ˆ‰Š‹Œ�Ž��‘’“”•–—˜™š›œ�žŸ
or something similar, depending on your output codepageActual Behavior
€‚ƒ„…†‡ˆ‰Š‹Œ
- most of the characters are missing.After f91b53d the whole range 0x80 - 0x9F is considered control characters.
The comment above boldly claims that
, but that is simply not true: as the example code above demonstrates, \x81, \x8D, \x8F, \x90, and \x9D are not handled by MultiByteToWideChar (at least in codepage 1252 ANSI - Latin I, hopefully popular enough), so no, not everything is Unicode by the time we get here, and no, we do need to worry about such confusion and implement a proper check to avoid breaking existing applications.
The text was updated successfully, but these errors were encountered: