(chore) Clean up all regexs to be UTF-8 compliant/ready #2759

joshgoebel · 2020-10-16T05:59:03Z

Work toward #2756.

Cleans up a lot of incorrect (unnecessary escaped) regex and would not compile with the u flag. After that makes some rather large performance improvements (with utf8 turned on at least) to yaml and mipsasm. It looks like the mipasm rules have been wrong all alone... as far as I can determine they are intended to match a literal . (otherwise they are far too broad) but were matching any character - which seems to terribly slow down the whole grammar in u mode.

This consisted mostly of:

Most unescaped { and }
Lots of unneeded escapes for -, <, >, and others.
Converting strings to regex if it made them simpler, easier to read (editor syntax coloring)

I plan to review the PR myself line by line, but can't imagine it'll be fun. I did try to change the minimal amount necessary. Often I turned strings into regex if it made them easier to read and see what I was doing. Once in a while I touched a nearby regex.

I imagine the fact that all tests still pass is a pretty good indication this is 99% correct. :-)

Note: This doesn't actually turn on UTF8 anywhere... it just fixes all the regex so that if u is added inside the main mode compiler everything "just works". It is still needed to be reviewed what else might need to be done on the road to UTF8 support.

joshgoebel · 2020-10-16T15:58:49Z

No rush, I'm pushing this off until 10.4. 10.3 is big enough.

src/languages/qml.js

joshgoebel · 2020-10-20T02:45:14Z

src/languages/tap.js

@@ -22,7 +22,7 @@ export default function(hljs) {
      },
      // YAML block
      {
-        begin: '(\s+)?---$', end: '\\.\\.\\.$',
+        begin: /---$/, end: '\\.\\.\\.$',


This was broken before \s vs \\s and when fixed seemed to break things so without any context going to go with the simpler rule for now.

src/languages/dsconfig.js

src/languages/mipsasm.js

joshgoebel · 2020-10-20T02:54:18Z

src/languages/prolog.js

    className: 'string',
-    begin: /0\'\\s/ // 0'\s
+    begin: /0'\\s/ // 0'\s


This is greek to me, but I think it's correct... anyone know Prolog? The "\s" is literal?

This makes sense to me now I think. Code might be:

% \s literal (escaped) 0'\s % character "b" 0'b % character "'" (escaped) 0'\'

I'm assuming 0'' is invalid.

Yea, looks like 0' is the way of getting the character code of a char. So 0'\s returns 32. And 0'' is valid, it returns 39

Screenshot and confirmation courtesy of @Adwitiya-Singh

More info, https://www.swi-prolog.org/pldoc/man?section=charescapes

And you are correct, 0'' is valid, it returns 39

Valid or invalid? Although I suppose the regex currently allows it either way, LOL... and I'm not even sure if that is bad or not.

My bad, I misread your original comment 😅 0'' is valid as seen in the screenshot. But since the regex allows it, works for me

src/languages/yaml.js

joshgoebel · 2020-10-20T02:58:27Z

src/languages/routeros.js

@@ -144,7 +144,7 @@ export default function(hljs) {
      }, //*/

      {
-        begin: '\\b(' + COMMON_COMMANDS.split(' ').join('|') + ')([\\s\[\(]|\])',
+        begin: '\\b(' + COMMON_COMMANDS.split(' ').join('|') + ')([\\s[(\\]|])',


Maybe I broke this? Not sure what it's supposed to be doing.

Pretty sure it's fine, added a comment to explain it.

joshgoebel · 2020-10-20T03:01:32Z

src/languages/lsl.js

@@ -31,29 +31,29 @@ export default function(hljs) {
        className: 'literal',
        variants: [
            {
-                begin: '\\b(?:PI|TWO_PI|PI_BY_TWO|DEG_TO_RAD|RAD_TO_DEG|SQRT2)\\b'
+                begin: '\\b(PI|TWO_PI|PI_BY_TWO|DEG_TO_RAD|RAD_TO_DEG|SQRT2)\\b'


There are speed optimizations believe it or not the benchmark was faster without the ":" to mark it as a non capturing group.

src/languages/mipsasm.js

src/lib/mode_compiler.js

joshgoebel · 2020-10-25T20:56:48Z

src/highlight.js

@@ -579,12 +579,17 @@ const HLJS = function(hljs) {
    @param {Array<string>} [languageSubset]
    @returns {AutoHighlightResult}
  */
+    let ts = {};


This will be reverted before merge. This whole PR changes nothing in the core library (other than grammars).

joshgoebel · 2020-10-25T20:57:40Z

I've 99% review this myself but could still use another set of eyes. :)

src/languages/ebnf.js

src/languages/gams.js

src/languages/javascript.js

src/languages/less.js

src/languages/lisp.js

src/languages/parser3.js

src/languages/routeros.js

src/languages/stylus.js

src/languages/xquery.js

src/languages/yaml.js

Co-authored-by: Vladimir Jimenez <allejo@me.com>

joshgoebel · 2020-11-02T23:13:36Z

Anything else? :)

allejo

Just don't forget to revert this but this looks good to me

This reverts commit b8bf164.

joshgoebel added this to the 10.4 milestone Oct 16, 2020

joshgoebel force-pushed the utf8_support branch from b6229d4 to c50c5b5 Compare October 16, 2020 15:51

joshgoebel requested review from allejo and egor-rogov October 16, 2020 15:57

joshgoebel mentioned this pull request Oct 16, 2020

(parser) Improper tokenization with variable names in languages that use Unicode / UTF-8 #2756

Open

21 tasks

joshgoebel added the hacktoberfest-accepted label Oct 16, 2020