You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's possible I'm misreading this, but I'm running into issues trying to read text-format protobuf files that contain string literals with non-ascii characters. Reading through the source, the following really does not look correct:
basically this is consuming chars but only ever returning bytes, and it is converting a char (which represents a unicode scalar value) directly into a u8 which will then be interpreted as utf-8; but outside of ascii the integer value of a char does not correspond to the utf-8 encoding of a char. (for instance the char À has a unicode scalar value of 192, but is encoded as 0xC3, 0x80 in utf-8).
I don't think this is hard to fix; you just need to stay in chars the whole time, and avoid converting to bytes. Given that the text_format input is always valid utf-8 (since you parse from &str, which is always valid utf-8) it should not be possible for a string literal to ever not be valid utf-8.
I'm going to go ahead and write a patch for this and PR it preemptively, since I think it should be relatively trivial; will figure out a test case as well.
The text was updated successfully, but these errors were encountered:
cmyr
linked a pull request
Jun 17, 2024
that will
close
this issue
It's possible I'm misreading this, but I'm running into issues trying to read text-format protobuf files that contain string literals with non-ascii characters. Reading through the source, the following really does not look correct:
rust-protobuf/protobuf-support/src/lexer/lexer_impl.rs
Lines 443 to 479 in 16c9dc5
basically this is consuming
chars
but only ever returning bytes, and it is converting achar
(which represents a unicode scalar value) directly into au8
which will then be interpreted asutf-8
; but outside of ascii the integer value of achar
does not correspond to the utf-8 encoding of achar
. (for instance the charÀ
has a unicode scalar value of 192, but is encoded as0xC3, 0x80
in utf-8).I don't think this is hard to fix; you just need to stay in
chars
the whole time, and avoid converting to bytes. Given that the text_format input is always valid utf-8 (since you parse from&str
, which is always valid utf-8) it should not be possible for a string literal to ever not be valid utf-8.I'm going to go ahead and write a patch for this and PR it preemptively, since I think it should be relatively trivial; will figure out a test case as well.
The text was updated successfully, but these errors were encountered: