Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genUnicodeString generates invalid unicode #167

Open
martyall opened this issue Sep 13, 2023 · 4 comments
Open

genUnicodeString generates invalid unicode #167

martyall opened this issue Sep 13, 2023 · 4 comments

Comments

@martyall
Copy link

martyall commented Sep 13, 2023

The unicode character generator for unicode characters is picking a random CodePoint in the BMP. The unicode string generator just generates an arbitrary array of such code points and turns it into a string. It turns out that this can generate invalid unicode via unpaired surrogates: https://unicode.org/faq/utf_bom.html#utf16-7

One solution here would be to restrict the code points to avoid such cases, another would be to figure out a more complicated but correct way to generate unicode which cannot be done CodePoint by CodePoint.

For context I discovered this while trying to write a quickcheck test for utf8 encoding/decoding, you can see the failing test here

@martyall
Copy link
Author

martyall commented Sep 13, 2023

On a related note, it appears that the unicode Char generator includes the code point 65536 (chooseInt is inlcusive?). Shouldn't this be 65535? (FYI, I've verified that this is not causing my error)

@natefaubion
Copy link
Contributor

On a related note, it appears that the unicode Char generator includes the code point 65536 (chooseInt is inlcusive?). Shouldn't this be 65535? (FYI, I've verified that this is not causing my error)

It's wrong as a bound, but 65536 is just getting turned into 65535 via toEnumWithDefaults, which is effectively clamp.

@natefaubion natefaubion changed the title Generators generate invalid unicode genUnicodeString generates invalid unicode Sep 13, 2023
@martyall
Copy link
Author

martyall commented Sep 13, 2023

Does something like this seem right?

https://github.com/f-o-a-m/purescript-bytestrings/pull/1/files#diff-71732b478b4808898d86c8591ad7ab46d8122c1e4facec4a9151ac49efba905dR107-R137

At least it passes the utf-8 round trip

@garyb
Copy link
Member

garyb commented Sep 24, 2023

Seems reasonable to me 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants