-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(chore) Clean up all regexs to be UTF-8 compliant/ready #2759
Conversation
b6229d4
to
c50c5b5
Compare
No rush, I'm pushing this off until 10.4. 10.3 is big enough. |
@@ -22,7 +22,7 @@ export default function(hljs) { | |||
}, | |||
// YAML block | |||
{ | |||
begin: '(\s+)?---$', end: '\\.\\.\\.$', | |||
begin: /---$/, end: '\\.\\.\\.$', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was broken before \s
vs \\s
and when fixed seemed to break things so without any context going to go with the simpler rule for now.
className: 'string', | ||
begin: /0\'\\s/ // 0'\s | ||
begin: /0'\\s/ // 0'\s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is greek to me, but I think it's correct... anyone know Prolog? The "\s" is literal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me now I think. Code might be:
% \s literal (escaped)
0'\s
% character "b"
0'b
% character "'" (escaped)
0'\'
I'm assuming 0''
is invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, looks like 0'
is the way of getting the character code of a char. So 0'\s
returns 32
. And 0''
is valid, it returns 39
Screenshot and confirmation courtesy of @Adwitiya-Singh
More info, https://www.swi-prolog.org/pldoc/man?section=charescapes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you are correct, 0'' is valid, it returns 39
Valid or invalid? Although I suppose the regex currently allows it either way, LOL... and I'm not even sure if that is bad or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I misread your original comment 😅 0''
is valid as seen in the screenshot. But since the regex allows it, works for me
@@ -144,7 +144,7 @@ export default function(hljs) { | |||
}, //*/ | |||
|
|||
{ | |||
begin: '\\b(' + COMMON_COMMANDS.split(' ').join('|') + ')([\\s\[\(]|\])', | |||
begin: '\\b(' + COMMON_COMMANDS.split(' ').join('|') + ')([\\s[(\\]|])', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I broke this? Not sure what it's supposed to be doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure it's fine, added a comment to explain it.
@@ -31,29 +31,29 @@ export default function(hljs) { | |||
className: 'literal', | |||
variants: [ | |||
{ | |||
begin: '\\b(?:PI|TWO_PI|PI_BY_TWO|DEG_TO_RAD|RAD_TO_DEG|SQRT2)\\b' | |||
begin: '\\b(PI|TWO_PI|PI_BY_TWO|DEG_TO_RAD|RAD_TO_DEG|SQRT2)\\b' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are speed optimizations believe it or not the benchmark was faster without the ":" to mark it as a non capturing group.
src/highlight.js
Outdated
@@ -579,12 +579,17 @@ const HLJS = function(hljs) { | |||
@param {Array<string>} [languageSubset] | |||
@returns {AutoHighlightResult} | |||
*/ | |||
let ts = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be reverted before merge. This whole PR changes nothing in the core library (other than grammars).
I've 99% review this myself but could still use another set of eyes. :) |
bdfb54f
to
3f06e89
Compare
Co-authored-by: Vladimir Jimenez <allejo@me.com>
Anything else? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just don't forget to revert this but this looks good to me
Work toward #2756.
Cleans up a lot of incorrect (unnecessary escaped) regex and would not compile with the
u
flag. After that makes some rather large performance improvements (with utf8 turned on at least) toyaml
andmipsasm
. It looks like the mipasm rules have been wrong all alone... as far as I can determine they are intended to match a literal.
(otherwise they are far too broad) but were matching any character - which seems to terribly slow down the whole grammar inu
mode.This consisted mostly of:
{
and}
-
,<
,>
, and others.I plan to review the PR myself line by line, but can't imagine it'll be fun. I did try to change the minimal amount necessary. Often I turned strings into regex if it made them easier to read and see what I was doing. Once in a while I touched a nearby regex.
I imagine the fact that all tests still pass is a pretty good indication this is 99% correct. :-)
Note: This doesn't actually turn on UTF8 anywhere... it just fixes all the regex so that if
u
is added inside the main mode compiler everything "just works". It is still needed to be reviewed what else might need to be done on the road to UTF8 support.