Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Unicode identifiers in GDScript and Expression #71676

Merged
merged 2 commits into from
Jan 23, 2023

Conversation

vnen
Copy link
Member

@vnen vnen commented Jan 19, 2023

This is using an adapted version of UAX#31 to not rely on the ICU database (which isn't available in builds without TextServerAdvanced). It allows most characters used in diverse scripts but not everything.

This is based on the previous work by @bruvzg on #53956.

This depends highly on #71598, otherwise it might show statements in a misleading order when there are RTL words.

gdscript-unicode-ids

Confusable identifiers: There are two checks for confusable identifiers using the methods of TextServer:

  1. Using spoof_check() which checks for mixed characters that can be confusing. E.g. var pοrt, which uses Greek omicron instead of Latin o). This is checked mostly on declarations and only gives a warning.
  2. Using is_confusable() against the list of GDScript keywords which checks for visual similarities. E.g. аs, which uses Cyrillic а). This gives an error.

Those checks only work properly if TextServerAdvanced is available. It is by default on official builds, but may be missing on custom builds (in such case they use the fallback noop from the basic TextServer).

Known issues:

  • While spoof_check() gives a warning, visually similar identifiers are treated as different names.
  • There are visually similar pairs that won't be able to have any warning if they don't use mixed scripts. E.g ç and (first uses regular "ç" character, that other uses a combining sequence "c+◌̧"). Also: and Å (first is \u212B, second is \u00C5).
    • One partial solution to this would be forbidding composition characters, except ones that never appears pre-composed. It would fix the first example but not the second. It could also create some hurdles for languages that uses lots of accents.
    • The proper solution would be to perform Unicode normalization. This would make the pairs in the examples become the same identifier, even if using a different codepoint combination. The problem is that this also requires ICU data.
  • Visually similar identifiers won't be treated as the same nor give any error or warning (if they pass spoof_check()).
    • We could run the is_confusable() check against the list of previously declared identifiers, but that can become quite big overtime, especially if we want to compare with engine classes, global script classes, and singleton autoloads, not to mention the inheritance tree of the current script.
    • I'm am unaware of the full security implications this might have.
  • I used 𝛑 on the screenshot as an example and that creates a confusable identifier warning. I guess this is the mathematical symbol, not the Greek letter, but I'm not sure why a single character would be problematic. This is from the TextServerAdvanced implementation, so any needed change it would be done there, though this is probably accurate to Unicode specification.
  • The Hebrew word מִבְחָן (I just used Google to translate test) also gives the confusable warning. I'm not familiar with Hebrew so can't tell for sure if there's an actual issue or if it's a false positive.
  • The simplified nature of the check might make some valid words not valid identifiers. In particular scripts that needs the ZWNJ (zero-width non-joiner) or ZWJ (zero-width) characters to display words correctly, since those aren't allowed. The UAX#31 has specific rules for it, which are used in TextServerAdvanced::is_valid_identifier(), but those are quite complex to apply, and also needs the ICU data.

Closes godotengine/godot-proposals#916

Special thanks to @bruvzg who did most of the groundwork and helped me out with this.

@vnen vnen force-pushed the gdscript-unicode-identifiers branch 3 times, most recently from 7672caf to d2af622 Compare January 21, 2023 15:13
@vnen vnen marked this pull request as ready for review January 21, 2023 15:13
@vnen vnen requested review from a team as code owners January 21, 2023 15:13
@vnen
Copy link
Member Author

vnen commented Jan 21, 2023

For what I can tell is already a good improvement so I'm marking as ready to review and merge. We can gather user feedback to see what are the hurdles they find when using this feature on actual projects.

vnen added 2 commits January 21, 2023 13:39
This is using an adapted version of UAX#31 to not rely on the ICU
database (which isn't available in builds without TextServerAdvanced).
It allows most characters used in diverse scripts but not everything.
@vnen vnen force-pushed the gdscript-unicode-identifiers branch from d2af622 to 86ee5f3 Compare January 21, 2023 16:40
@akien-mga akien-mga merged commit 5726bf5 into godotengine:master Jan 23, 2023
@akien-mga
Copy link
Member

Thanks! 🎉

@vnen vnen deleted the gdscript-unicode-identifiers branch January 23, 2023 12:46
@baptr
Copy link
Contributor

baptr commented Jan 29, 2023

7548e04#diff-0e70e5b10ffb6bcbf01a9c92c0ec2b948011460a2d4ce20ab3443ffbbf9d2de8R588-R595 makes beta16 pretty annoying to use if you previously had a player enum value like P1 (now declared too similar to the built-in PI).

The debug error bypasses any project options, and seems to ~silently block recompilation.

HolonProduction added a commit to HolonProduction/godot-gdscript-toolkit that referenced this pull request Feb 7, 2023
HolonProduction added a commit to HolonProduction/godot-gdscript-toolkit that referenced this pull request Feb 8, 2023
HolonProduction added a commit to HolonProduction/godot-gdscript-toolkit that referenced this pull request Feb 8, 2023
Scony pushed a commit to Scony/godot-gdscript-toolkit that referenced this pull request Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow Unicode characters in GDScript identifiers
5 participants