Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid regexes as detected by GitHub's grammar compiler #99

Closed
4 tasks
lildude opened this issue Feb 21, 2023 · 7 comments
Closed
4 tasks

Invalid regexes as detected by GitHub's grammar compiler #99

lildude opened this issue Feb 21, 2023 · 7 comments

Comments

@lildude
Copy link

lildude commented Feb 21, 2023

Describe what you see, what you want to see, and perhaps some linkage to docs, synopses, or irclog chatter.

👋 I'm the lead maintainer of the https://github.com/github/linguist library which is used for language detection and providing the syntax highlighting for languages on GitHub.com, and we use this grammar.

Our grammar compiler has found a problem with your grammar which I thought I'd let you know about.

- [ ] repository `vendor/grammars/atom-language-perl6` (from https://github.com/perl6/atom-language-perl6) (6 errors)
  - Invalid regex in grammar: `source.raku` (in `grammars/raku.cson`) contains a malformed regex (regex "`(?x) ( [\p{Digit}\p{Alpha}'\-_]+`...": unknown property name after \P or \p (at offset 16))
  - Invalid regex in grammar: `source.raku` (in `grammars/raku.cson`) contains a malformed regex (regex "`[\p{Digit}\p{Alpha}'\-_]+`": unknown property name after \P or \p (at offset 9))
  - Invalid regex in grammar: `source.raku` (in `grammars/raku.cson`) contains a malformed regex (regex "`(?x)(\$|@|%|&)(\.|\*|:|!|\^|~|`...": unknown property name after \P or \p (at offset 57))
  - Invalid regex in grammar: `source.raku` (in `grammars/raku.cson`) contains a malformed regex (regex "`(?x)(?<!\\)(\$|@|%|&)(?!\$)(`...": unknown property name after \P or \p (at offset 75))
  - Invalid regex in grammar: `source.raku` (in `grammars/raku.cson`) contains a malformed regex (regex "`(`": missing ) (at offset 1))
  - Invalid regex in grammar: `source.raku` (in `grammars/raku.cson`) contains a malformed regex (regex "`)`": unmatched parentheses (at offset 0))

GitHub uses PCRE instead of Ruby's Oniguruma regex engine for performance reasons which means some of your regexes aren't interpretted, in this case the \p character classes.

Example Code

N/A

Picture [optional]

Providing a picture means that if the issue is fixed and linguist updates to include the fix (linguist uses this package to highlight Raku on GitHub), your issue will remain historically viewable.

Leave this in. For internal use.

  • Fixed in Master
  • Fixed in Release
  • Has Tests
  • Passes Tests
@2colours
Copy link
Collaborator

Hi @lildude ,

if I understand correctly, since this highlighter was designed with Oniguruma in mind, this is not really an issue with the content per se.

I'm fairly recent to take up on this topic so would you mind if I ask: how can one describe the same content with PCRE? If it's not possible, the change would have questionable value.
Thanks.

@lildude
Copy link
Author

lildude commented Apr 20, 2023

https://www.regular-expressions.info/unicode.html#category is a great resource for this.

@2colours
Copy link
Collaborator

2colours commented Jul 3, 2023

A follow-up:

Alpha = L(etter) or M(ark)
Digit = Nd (decimal number)

Hopefully they would work with Oniguruma as well; honestly, I had to look up what a mark approximately is in the first place. 😅

raiph added a commit to raiph/atom-language that referenced this issue Jul 4, 2023
NOT TESTED!!!

Per Raku#99, Github's regex validator does not accept Oniguruma regexes. There's
some escaping/filtering going on that makes the report hard to interpret but I'm pretty sure I figured it out. (I think it's just some `\\p{Alpha}`s that needed translating to `\\pL\\pM`, and a `"("` and `")"` pair of strings that needed escaping of the parens).
@raiph
Copy link

raiph commented Jul 4, 2023

Anyone reading this, ignore the specifics listed in the opening comment. Instead, click Outstanding Grammar Issues to view the latest Raku errors, if any. At the time of writing this comment the file with errors in it has changed its name to raku.tmLanguage.json and it looks like there are just two errors:

  • Several uses of \p{Alpha}.

  • A "(" string and a ")" string that are presumably also being treated as regexes.

I've PR'd an edit of the file that I think (not tested!) does the minimal change of the things Github is saying are invalid.

I'm actually suspicious the \p{Alpha} was a bit "off" because a simple interpretation of what I read it was suggests it would have picked up a sequence that begins with a mark rather than only marks that follow an initial letter. But I'm not going to investigate that right now.

@2colours
Copy link
Collaborator

2colours commented Aug 3, 2023

#103 is out now; hopefully we get to see the result of it soon.

@2colours
Copy link
Collaborator

github-linguist/linguist#3924 (comment) seems like we are good now, as far as the syntax goes. Closing this issue.

@raiph
Copy link

raiph commented Aug 14, 2023

Ugh. It looks like I misinterpreted the column count of some of the error messages (the column count was ambiguously pointed at the t of \p{Digit} or \ of \p{Alpha} and I thought, it seems incorrectly, it was the latter) and I also didn't understand your comment about Nd (which ironically is why I took a look in the first place, in an attempt to understand what the error message and your comment might have meant).

Anyway, thanks for fixing that and shepherding this to a successful conclusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants