Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shouldn't "\UA66D" be valid, if not a unicode sequence? #112

Open
renatoathaydes opened this issue May 7, 2020 · 3 comments
Open

Shouldn't "\UA66D" be valid, if not a unicode sequence? #112

renatoathaydes opened this issue May 7, 2020 · 3 comments

Comments

@renatoathaydes
Copy link

The test test_parsing/n_string_unicode_CapitalU.json considers that "\UA66D" should fail to parse.

But I am confused by this because the spec says that any character may be escaped, hence it's not illegal to escape U? Sure, this is not supposed to be a unicode sequence because unicode sequences use lowercase u... but why shouldn't the parser accept this as the string UA66D?

@renatoathaydes
Copy link
Author

Would this also be illegal??

\Uzzzz

@renatoathaydes
Copy link
Author

renatoathaydes commented May 7, 2020

I believe the word "escaped" is being used for two different concepts:

  • escaping as in \ followed by a escaped character.
  • escaping as in using \u<unicode>.

I don't know why the same word is used in both cases though, it's just confusing.

I assume that the first variety can only be used to escape the characters explicitly mentioned in the RFC:

    escape (
              %x22 /          ; "    quotation mark  U+0022
              %x5C /          ; \    reverse solidus U+005C
              %x2F /          ; /    solidus         U+002F
              %x62 /          ; b    backspace       U+0008
              %x66 /          ; f    form feed       U+000C
              %x6E /          ; n    line feed       U+000A
              %x72 /          ; r    carriage return U+000D
              %x74 /          ; t    tab             U+0009
              %x75 4HEXDIG )  ; uXXXX                U+XXXX

All other characters MAY be escaped using the \u notation. That I think makes sense.

I would say that calling the \u notation "escaping" is very misleading: it's not escaping the character, it's using its encoded form... but I guess RFC authors are not known for their ability to use words unambiguously.

Hope someone can confirm my interpretation is correct.

@themobiusproject
Copy link

themobiusproject commented Jun 9, 2020

I believe you are referring to:

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point.

The form \uhhhh can denote any character from U+0000 to U+FFFF. The form \U is also a valid form (but not in json) and is followed by eight hexadecimal digits in the form \Uhhhhhhhh which can then be broken down into two \u sets to denote characters U+10000 to U+FFFFFFFF.

Getting back to your original question, \u is the escape sequence for any character.

Two more notes, the first from RFC-8259:

The representation of strings is similar to conventions used in the C family of programming languages.

The second from Wikipedia:

A sequence such as \z is not a valid escape sequence according to the C standard as it is not found in the table above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants