Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we forbid U+226E (≮) and U+226F (≯) in hosts? #733

Closed
annevk opened this issue Jan 12, 2023 · 15 comments
Closed

Should we forbid U+226E (≮) and U+226F (≯) in hosts? #733

annevk opened this issue Jan 12, 2023 · 15 comments

Comments

@annevk
Copy link
Member

annevk commented Jan 12, 2023

From https://www.unicode.org/reports/tr46/#UseSTD3ASCIIRules:

There are a very small number of non-ASCII characters with the data file status disallowed_STD3_valid:

U+2260 ( ≠ ) NOT EQUAL TO
U+226E ( ≮ ) NOT LESS-THAN
U+226F ( ≯ ) NOT GREATER-THAN

Those characters are disallowed with UseSTD3ASCIIRules=true because the set of characters in their canonical decompositions are not entirely in the valid set (Step 7 of the Table Derivation). However, they are allowed with UseSTD3ASCIIRules=false, because the base characters of their canonical decompositions, U+003D ( = ) EQUALS SIGN, U+003C ( < ) LESS-THAN SIGN, and U+003E ( > ) GREATER-THAN SIGN, are each valid under that option. If an implementation uses UseSTD3ASCIIRules=false but disallows any of these three ASCII characters, then it must also disallow the corresponding precomposed character for its negation.

We allow =, but < and > are forbidden. All of the three non-ASCII code points listed above work fine in WebKit and I personally might not see the problem as strongly as UTS46 does. I added tests for them in web-platform-tests/wpt#37907. (The tests reflect the status quo.)

Thoughts?

cc @karwa @ricea @achristensen07 @valenting

@ricea
Copy link

ricea commented Jan 13, 2023

On my computer http://example≯ looks very similar to http://example>/, which is not great. But it's probably not a good enough reason to change the status quo.

@karwa
Copy link
Contributor

karwa commented Jan 13, 2023

Fundamentally, I'm not even sure why the decomposition of these characters is even relevant - UTS46 normalises them to a composed form and Punycodes that, so none of these characters should ever result in naked ASCII =/</> characters being sent over the wire -- and I think that's all that standards such as STD3, or DNS servers, routers, etc should care about; that it doesn't collide with other delimiters and whatnot.

So I see no technical reason why these characters should be disallowed. And I see no non-technical reason why we should disallow characters such as , while allowing all of the following:

http://┴/ - box drawing character. Allowed => http://xn--qxh/
http://∫/ - integral symbol. Allowed => http://xn--jbh/
http://𝜢𝜠𝜰/ - Mathematical bold italic capitals. Allowed => http://xn--qxad7b/
http://𐦖.𓀡.𓀈/ - Ancient Egyptian hieroglyphics. Allowed => http://xn--6n9c.xn--3p7d.xn--ep7d/
http://helpme𓏎/ - Another hieroglyphic. U+133CE POT WITH LEGS. Allowed => http://xn--helpme-gt36b/

@annevk
Copy link
Member Author

annevk commented Jan 13, 2023

Thanks! I suppose this is another issue where it would be great to get input from @markusicu @macchiati.

@macchiati
Copy link

There's are good points. Markus, see any good reason to disallow, given that the result has to be NFC?

@markusicu
Copy link

I am not vested in these three characters, or possible future ones with this behavior. Clearly the UTS46 rule is based on their Decomposition_Mapping, but UTS46 does use NFC compositions, and there are no compositions with other combining marks that could block these.

Who decides on these things? Consensus of browser makers?

For a formal request to change this, please use https://www.unicode.org/reporting.html --> UTC / Report Error in Publication/Data

@annevk
Copy link
Member Author

annevk commented Jan 14, 2023

Thanks, I'll file feedback as well as for #543 in time for Unicode's April meeting.

In my experience of trying to make IDNA interoperable over the past decade browsers have not been super opinionated on ToASCII. (Now ToUnicode is another matter, but that algorithm isn't directly exposed.) As long as we err on the side of compatibility, i.e., making hosts resolve, I think it should work out.

And apparently the IETF hasn't been opinionated enough either as according to a comment in that other issue they gave up on standardizing the details of client behavior with IDNA2008. So I'm very thankful we have UTS46.

@annevk
Copy link
Member Author

annevk commented Jan 16, 2023

Tentative feedback (not submitted yet):

Please change U+2260 (≠), U+226E (≮), and U+226F (≯) from disallowed_STD3_valid to valid.

These code points are not decomposed so they can never conflict with =, <, and >. And they are not inherently more confusing than any of the other allowed code points, which include hieroglyphics and emoji. These code points also work as-is in all browser engines (while < and > are forbidden) and on balance preference ought to be given to retaining compatibility so end users are not prevented from visiting websites or seeing subresources that might use these code points in their domain for one reason or another.

For further background and discussion please see https://github.com/whatwg/url/issues/733.

Thank you!

@macchiati
Copy link

macchiati commented Jan 17, 2023 via email

@markusicu
Copy link

tentative feedback lgtm

@annevk
Copy link
Member Author

annevk commented Jan 23, 2023

Thanks, it's now submitted along with some other items, summarized in #744. I haven't yet submitted feedback on CheckBidi as I'm still not sure what to recommend. See #543.

@rmisev
Copy link
Member

rmisev commented Sep 25, 2024

Please change U+2260 (≠), U+226E (≮), and U+226F (≯) from disallowed_STD3_valid to valid.

This has already been fixed in UTS 46 15.1.0, see https://www.unicode.org/reports/tr46/tr46-31.html#Modifications
So maybe this issue can be closed?

@annevk
Copy link
Member Author

annevk commented Sep 25, 2024

I guess we were already testing this? If so, agreed.

@rmisev
Copy link
Member

rmisev commented Sep 25, 2024

Yes, there are tests for these characters, but we test with UseSTD3ASCIIRules=false:
https://github.com/web-platform-tests/wpt/blob/a19eaaf167389a79c8971fbd25c557965541bdfd/url/resources/toascii.json#L163-L175

@annevk
Copy link
Member Author

annevk commented Sep 25, 2024

That seems correct, no?

@rmisev
Copy link
Member

rmisev commented Sep 26, 2024

Yes, the tests are correct.

@annevk annevk closed this as completed Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants