Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readline: use icu based string width calculation #9040

Closed
wants to merge 1 commit into from

Conversation

jasnell
Copy link
Member

@jasnell jasnell commented Oct 11, 2016

Checklist
  • make -j8 test (UNIX), or vcbuild test nosign (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

readline, internal

Description of change

Rather than the pseudo-wcwidth impl used currently, use the ICU character properties database to calculate string width and determine if a character is full width or not. This allows the algorithm to correctly identify emoji's as full width, ensures the algorithm will continue to fucntion properly as new unicode codepoints are added, and it's faster.

This was originally part of a proposal to add a new unicode module, but has been split out.

Refs: #8075

@jasnell jasnell added readline Issues and PRs related to the built-in readline module. i18n-api Issues and PRs related to the i18n implementation. labels Oct 11, 2016
@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. readline Issues and PRs related to the built-in readline module. labels Oct 11, 2016
@jasnell jasnell added the semver-minor PRs that contain new features and should be released in the next minor version. label Oct 11, 2016
@jasnell
Copy link
Member Author

jasnell commented Oct 11, 2016

Technically semver-minor because it allows now a single codepoint to be passed in as the argument to getStringWidth().

@jasnell
Copy link
Member Author

jasnell commented Oct 12, 2016

@bnoordhuis
Copy link
Member

This change would make intl and non-intl builds behave differently, wouldn't it? I think I'd rather have a single semi-broken algorithm than two diverging implementations. As well, the non-intl path will be untested until nodejs/build#419 is resolved.

@jasnell
Copy link
Member Author

jasnell commented Oct 12, 2016

Intl and non-intl builds already behave differently in many ways. This
would be no different. Even v8 already acts differently if ICU isn't
present.

On Wednesday, October 12, 2016, Ben Noordhuis [email protected]
wrote:

This change would make intl and non-intl builds behave differently,
wouldn't it? I think I'd rather have a single semi-broken algorithm than
two diverging implementations. As well, the non-intl path will be untested
until nodejs/build#419 nodejs/build#419 is
resolved.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#9040 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAa2eTOFILRRLZmfbAkdyGUaprCAOECaks5qzJ_XgaJpZM4KUK7U
.

@bnoordhuis
Copy link
Member

Intl and non-intl builds already behave differently in many ways.

Oh? In what way?

Even v8 already acts differently if ICU isn't present.

"Everyone is doing it" is never a valid argument. V8 is not under our direct control but this is.

@jasnell
Copy link
Member Author

jasnell commented Oct 12, 2016

Oh? In what way?

Off the top of my head:

  1. Presence of Intl
  2. Results of String.prototype.normalize (try '\u1E9B\u0323'.normalize('NFD') === '\u1E9B\u0323'.normalize('NFC') in a default build vs. --without-intl build)
  3. Behavior of locale specific methods like toLocaleUpperCase(), localeCompare(), and toLocaleString()
  4. ICU's 10x faster punycode implementation used if ICU is present (recent change)

There's likely more.

It's also not just about a more accurate implementation, performance is also significantly improved with the new algorithm.

this PR:

$ ./node --expose-internals benchmark/misc/stringwidth.js
misc/stringwidth.js test="a" millions=5: 6.288441921473474
misc/stringwidth.js test="丁" millions=5: 6.230841486482881
misc/stringwidth.js test="👸🏿" millions=5: 4.1727458183443265
misc/stringwidth.js test="👅" millions=5: 5.148377225269059
misc/stringwidth.js test="\n" millions=5: 6.285715980343315
misc/stringwidth.js test="‎f‏" millions=5: 5.2453819193365865
misc/stringwidth.js test="‎\n∊⃒" millions=5: 4.174633509425163

vs. latest v6:

$ node --expose-internals benchmark/misc/stringwidth.js
misc/stringwidth.js test="a" millions=5: 1.3992433734181065
misc/stringwidth.js test="丁" millions=5: 1.7113687595793867
misc/stringwidth.js test="👸🏿" millions=5: 1.5577253551652364
misc/stringwidth.js test="👅" millions=5: 1.6861030856604042
misc/stringwidth.js test="\n" millions=5: 1.7522551193962668
misc/stringwidth.js test="‎f‏" millions=5: 1.5648816109285903
misc/stringwidth.js test="‎\n∊⃒" millions=5: 1.459896127806548

(the benchmark test benchmark/misc/stringwidth.js is not part of this PR currently, it's local on my machine)

/**
* Tries to remove all VT control characters. Use to estimate displayed
* string width. May be buggy due to not running a real state machine
*/
function stripVTControlCharacters(str) {
str = str.replace(new RegExp(functionKeyCodeReAnywhere.source, 'g'), '');
return str.replace(new RegExp(metaKeyCodeReAnywhere.source, 'g'), '');
return str.replace(ansi, '');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need /g?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it uses /g.



module.exports = {
emitKeys,
getStringWidth,
isFullWidthCodePoint,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should initially assign no-ops with comments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come? that does not seem very practical.

stripVTControlCharacters
};

if (process.binding('config').hasIntl) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not detectable from process.config?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using process.config is unreliable because there are userland packages that override it. That's why process.binding('config') was introduced.

return icu.getStringWidth(stripVTControlCharacters(str));
};
module.exports.isFullWidthCodePoint = function isFullWidthCodePoint(code) {
if (typeof code !== 'number')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this remain isNaN() ... or rather, Number.isNaN()?

// if the key.sequence is half of a surrogate pair,
// refresh the line so the character is displayed appropriately.
const ch = key.sequence.codePointAt(0);
if (ch >= 0xd800 && ch <= 0xdfff)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment about these values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the existing comment on line 128 and 129 not enough?

@rvagg rvagg force-pushed the master branch 2 times, most recently from c133999 to 83c7a88 Compare October 18, 2016 17:02
@jasnell
Copy link
Member Author

jasnell commented Oct 21, 2016

@jasnell
Copy link
Member Author

jasnell commented Oct 22, 2016

Only unrelated flaky failures in CI. @nodejs/collaborators would appreciate additional review.
@trevnorris @addaleax ... PTAL, does this LGTY?

Copy link
Member

@bnoordhuis bnoordhuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I still don't think this should land until we have a proper non-intl buildbot. So far, almost every intl-related change broke the non-intl build in one way or another.

// Adopted from https://github.com/chalk/ansi-regex/blob/master/index.js
// License: MIT, authors: @sindresorhus, Qix-, and arjunmehta
const ansi =
/[\u001b\u009b][[()#;?]*(?:[0-9]{1,4}(?:;[0-9]{0,4})*)?[0-9A-ORZcf-nqry=><]/g;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex could use a comment explaining what it tries to match/capture. Also, four space indent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

four space indent would make the line longer than 80 chars, and splitting the regex into multiple lines would make it less readable. I'd prefer to leave the spacing as is.

// newer wide characters. wcwidth, on the other hand, uses a fixed
// algorithm that does not take things like emoji into proper
// consideration.
static int GetColumnWidth(UChar32 codepoint, bool ambiguousFull = false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: s/ambiguousFull/ambiguous_full/ here and elsewhere. I don't like the name very much, it doesn't really convey what it does.

// algorithm that does not take things like emoji into proper
// consideration.
static int GetColumnWidth(UChar32 codepoint, bool ambiguousFull = false) {
const int eaw = u_getIntPropertyValue(codepoint, UCHAR_EAST_ASIAN_WIDTH);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why UCHAR_EAST_ASIAN_WIDTH? This needs an explaining comment.

return 2;
case U_EA_AMBIGUOUS:
if (ambiguousFull)
return 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the fall-through is intentional, can you add // fallthrough comments?

if (u_iscntrl(codepoint) ||
u_hasBinaryProperty(codepoint, UCHAR_EMOJI_MODIFIER) ||
u_getCombiningClass(codepoint) > 0)
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put braces around the body, slightly easier to read.

}

bool ambiguousFull = args[1]->BooleanValue();
bool expandEmojiSequence = args[2]->BooleanValue();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: snake_case

bool expandEmojiSequence = args[2]->BooleanValue();

TwoByteValue value(env->isolate(), args[0].As<String>());
UChar* str = reinterpret_cast<UChar*>(*value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reinterpret_cast shouldn't be necessary, should it?

n > 0 && p == 0x200d && // 0x200d == emoji sequence continuation
(u_hasBinaryProperty(c, UCHAR_EMOJI_PRESENTATION) ||
u_hasBinaryProperty(c, UCHAR_EMOJI_MODIFIER)))
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Braces?

@jasnell
Copy link
Member Author

jasnell commented Oct 23, 2016

Updated to address the nits.

@jasnell
Copy link
Member Author

jasnell commented Oct 23, 2016

@bnoordhuis ... added a nointl CI job we can run manually: https://ci.nodejs.org/job/node-test-commit-linux-nointl/1/
/cc @jbergstroem ... I'll tweak this a bit later to get it added to the main CI group.

@jbergstroem
Copy link
Member

@jasnell: I don't think it should be added to the current job as-is. There's way too many workers involved. Do we need to test this on every linux flavor? I was hoping to have a small subset of workers for build permutation tests.

@jasnell jasnell dismissed bnoordhuis’s stale review October 24, 2016 14:18

Updated. Needs another review from @bnoordhuis

Copy link
Member

@bnoordhuis bnoordhuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a comment and a style nit.

if (args[0]->IsNumber()) {
args.GetReturnValue().Set(
GetColumnWidth(args[0]->Uint32Value(),
ambiguous_as_full_width));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you line up the argument?

return;
}

TwoByteValue value(env->isolate(), args[0].As<String>());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TwoByteValue's constructor takes a Local<Value>. It's arguably wrong to cast because there is no check that it's really a string so I'd just let the constructor deal with this.

Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, with a couple of minor comments

// The expand_emoji_sequence option allows the caller to skip this
// check and count each code within an emoji sequence separately.
if (!expand_emoji_sequence &&
n > 0 && p == 0x200d && // 0x200d == emoji sequence continuation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

U+200D is a ZWJ (zero width joiner), please use that term

// in advance if a particular sequence is going to be supported.
// The expand_emoji_sequence option allows the caller to skip this
// check and count each code within an emoji sequence separately.
if (!expand_emoji_sequence &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reasoable way of doing this calculation

return 2;
case U_EA_AMBIGUOUS:
if (ambiguous_as_full_width) {
return 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth referencing something like http://www.unicode.org/reports/tr11/#Ambiguous here

@jasnell
Copy link
Member Author

jasnell commented Oct 24, 2016

Updated with the final nits addressed. New CI: https://ci.nodejs.org/job/node-test-pull-request/4655/

@jasnell
Copy link
Member Author

jasnell commented Oct 25, 2016

had a compile nit pop up on windows... trying again https://ci.nodejs.org/job/node-test-pull-request/4664/

Rather than the pseudo-wcwidth impl used currently, use the ICU
character properties database to calculate string width and
determine if a character is full width or not. This allows the
algorithm to correctly identify emoji's as full width, ensures
the algorithm will continue to fucntion properly as new unicode
codepoints are added, and it's faster.

This was originally part of a proposal to add a new unicode module,
but has been split out.

Refs: nodejs#8075
@jasnell
Copy link
Member Author

jasnell commented Oct 25, 2016

CI is green except for unrelated flaky failures. Landing

jasnell added a commit that referenced this pull request Oct 25, 2016
Rather than the pseudo-wcwidth impl used currently, use the ICU
character properties database to calculate string width and
determine if a character is full width or not. This allows the
algorithm to correctly identify emoji's as full width, ensures
the algorithm will continue to fucntion properly as new unicode
codepoints are added, and it's faster.

This was originally part of a proposal to add a new unicode module,
but has been split out.

Refs: #8075
PR-URL: #9040
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Steven R Loomis <[email protected]>
@jasnell
Copy link
Member Author

jasnell commented Oct 25, 2016

Landed in 72547fe

@jasnell jasnell closed this Oct 25, 2016
@jasnell
Copy link
Member Author

jasnell commented Oct 25, 2016

Shrinking it down ought to be fine. I'll do that this next week

On Sunday, October 23, 2016, Johan Bergström [email protected]
wrote:

@jasnell https://github.com/jasnell: I don't think it should be added
to the current job as-is. There's way too many workers involved. Do we need
to test this on every linux flavor? I was hoping to have a small subset of
workers for build permutation tests.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#9040 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAa2eSj8DXEb45zGppZSrbdxdlcyksZkks5q2650gaJpZM4KUK7U
.

@bnoordhuis
Copy link
Member

@jasnell g++ spotted a bug:

../src/node_i18n.cc: In function 'void node::i18n::GetStringWidth(const v8::FunctionCallbackInfo<v8::Value>&)':
../src/node_i18n.cc:543:15: warning: 'c' may be used uninitialized in this function [-Wmaybe-uninitialized]
     if (!expand_emoji_sequence &&
         ~~~~~~~~~~~~~~~~~~~~~~~~~
         n > 0 && p == 0x200d &&  // 0x200d == ZWJ (zero width joiner)

I think you need a U16_NEXT call outside the loop.

@jasnell
Copy link
Member Author

jasnell commented Oct 25, 2016

Hmm.. Ok, away from the laptop at the moment but will look at that shortly

@srl295
Copy link
Member

srl295 commented Oct 25, 2016

@bnoordhuis good catch - otherwise p=<undefined>

A U16_NEXT(str, n, value.length(), c); before the loop will initialize c.
May also obviate the n>0 check within the loop.

@bnoordhuis
Copy link
Member

#9280

evanlucas pushed a commit that referenced this pull request Nov 3, 2016
Rather than the pseudo-wcwidth impl used currently, use the ICU
character properties database to calculate string width and
determine if a character is full width or not. This allows the
algorithm to correctly identify emoji's as full width, ensures
the algorithm will continue to fucntion properly as new unicode
codepoints are added, and it's faster.

This was originally part of a proposal to add a new unicode module,
but has been split out.

Refs: #8075
PR-URL: #9040
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Steven R Loomis <[email protected]>
@MylesBorins
Copy link
Contributor

/cc @srl295 currently passing on this being backported to v6.x. Do you think it should be considered?

@jasnell
Copy link
Member Author

jasnell commented May 15, 2017

I'm not @srl295, of course, but I'll chime in to say that I see no pressing reason to backport this

@srl295
Copy link
Member

srl295 commented Aug 31, 2017

no pressing reason

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. i18n-api Issues and PRs related to the i18n implementation. readline Issues and PRs related to the built-in readline module. semver-minor PRs that contain new features and should be released in the next minor version.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants