-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: Preserve Unicode Escape Sequences and make current behavior opt-in to avoid malformed output in some environments #485
Comments
This is good to know. I think this increases the priority of having an ASCII-only output mode (#70).
I don't think this sounds like the right solution. Either your environment supports UTF-8 or it doesn't. If it doesn't, you don't want unescaped UTF-8 in your output even if it was originally written that way in the source code. And if it does, I don't see why esbuild shouldn't take advantage of it. So I think esbuild should either always use UTF-8 in string literals, or always generate escape sequences for all non-ASCII characters in string literals.
I don't think the default behavior should be to assume the target environment doesn't support UTF-8. Partially because IE is well on its way out, and because other commonly-used minifiers such as Terser and UglifyJS also have the same default behavior as esbuild here (assuming that UTF-8 is supported). Edit: and after a quick check it looks like Webpack and Parcel also do this by default, so most of the ecosystem behaves this way. So I think the way to solve this is an opt-in ASCII-only output mode (issue #70). I would have done this a long time ago except this unfortunately isn't as simple as it sounds because it involves parsing regular expressions to strip UTF-8 from them too, and esbuild doesn't currently parse regular expressions at all. However, it's pretty trivial to move forward with an ASCII-only output option with the caveat that regular expressions containing non-ASCII characters will continue to be unescaped in the output for now. So right now I'm thinking of doing that. |
Something else to consider as well is that many browsers have an optimized scanning path for ASCII code (even when marked as UTF-8). Converting to UTF-8 may provide a decrease in file size but could end up causing increased execution time and cancel out any improvements in transfer time. Chromium code reference: https://source.chromium.org/chromium/v8/v8.git/+/master:src/parsing/scanner-character-streams.cc;l=669?originalUrl=https:%2F%2Fcs.chromium.org%2F |
This is interesting. My first instinct was to discount this because surely download times massively dwarf parse times in the vast majority of cases, especially because scanning UTF-8 is very cache-friendly. But the chunk size is 512 which is kind of big. I also found https://v8.dev/blog/scanner which seems to indicate that parsing an ASCII string is 1.7x faster than parsing a UTF-8 string. There is also this quote at the end of the article:
It turns out that esbuild also does this kind of optimization itself when mapping from byte offsets to column numbers for source maps. It was a noticeable speedup to avoid doing this for ASCII-only lines (chunking happens at the line level): esbuild/internal/js_printer/js_printer.go Lines 677 to 689 in da972af
I still think this isn't that strong a reason though, since these cases are relatively rare. A given chunk of JavaScript code will be ASCII the vast majority of the time because minifiers use ASCII identifiers for variable names. To me a stronger reason is the convenience of not having to deal with UTF-8 encoding issues when using esbuild. Some of my goals for esbuild are a) to follow ecosystem conventions and b) to be easy to use without too much configuration. Those two goals are in conflict in this situation and I'm still trying to figure out what to do about it. It's why I haven't released this new flag yet. |
Please see this repro.
In it, you'll see an input file such as this:
Running
esbuild
with no configuration on it results in the fileoutput.js
which produces a different declaration ofunicodeWhitespaceCharacters
as can be seen in the repro.Intuitively, I would expect a 1:1 mapping between the input and output in this particular case.
I've been trying to make sense of it, and I understand that this is related to an optimization you've written that rewrites unicode escape sequences. In this issue: #70, you write the following:
However, this may lead to
Unterminated string constant
exceptions in browser such as IE 11 - even with<meta charset="utf-8">
in the HTML document. On that note, it is not always possible for the website author to instruct IE to use UTF-8. There's a bunch of ways IE can choose to ignore that.As I can see there have been a few similar issues like this one (#315, #112, #70), I would like to propose that unicode escape sequences are preserved as they are presented in the source code, and then the current behavior could be opt-in, for example as an advanced minification option.
I ran into this problem specifically while bundling
core-js
(the declaration from the example is located withincore-js/internals/whitespaces.js
withesbuild
and testing it across browsers.I don't currently have any way of working around this other than performing a string replacement on the bundle generated by
esbuild
to bring back the unicode escape sequences and updating the sourcemap mappings, which is slow and tedious, so I really hope for a better way around this :-)The text was updated successfully, but these errors were encountered: