Add `baseLangage` to language grammars #2275

joshgoebel · 2019-11-12T06:43:19Z

From my tweaked language checker:

┌────────────────────┬────────────────────┬──────────┬────────────────────┬──────────┬────────────────────┐
│ expected           │ actual             │ score    │ 2nd best           │ score    │ info               │
├────────────────────┼────────────────────┼──────────┼────────────────────┼──────────┼────────────────────┤
│ cpp                │ cpp                │ 16       │ arduino            │ 16       │ Relevance match.   │
├────────────────────┼────────────────────┼──────────┼────────────────────┼──────────┼────────────────────┤
│ cpp                │ cpp                │ 25       │ arduino            │ 25       │ Relevance match.   │
├────────────────────┼────────────────────┼──────────┼────────────────────┼──────────┼────────────────────┤
│ xml                │ xml                │ 46       │ django             │ 46       │ Relevance match.   │
├────────────────────┼────────────────────┼──────────┼────────────────────┼──────────┼────────────────────┤
│ xml                │ xml                │ 7        │ django             │ 7        │ Relevance match.   │
├────────────────────┼────────────────────┼──────────┼────────────────────┼──────────┼────────────────────┤
│ xml                │ xml                │ 3        │ django             │ 3        │ Relevance match.   │

These aren't surprising, but also aren't very helpful. They would be reported as errors except for the load order that's enforced by the requires... and the fact that we magically prefer the "first match" even if the scores are the same.

I'd like to add an explicit key to explain these relationships to the parser.

So that Arduino would be a superset: 'cpp' and Django would be a superset: 'xml'. I'm open to better naming. If two languages are detected to be related in this way then the auto-detect would simply ignore the superset in cases of EQUAL matches, only preferred it if it had a HIGHER match (which is what you'd expect if it was the superset, and not the original).

Also if you're looking to correct an incorrect match it's not helpful to know the second match to cpp was Arduino, they are essentially the same (not going to color differently, etc). It would be useful to know the second match (OTHER than Arduino) was say Pascal.

That's the change I'm suggested here if no one is opposed. Should only be a few lines of code and adding the appropriate attribute to a few grammars.

The text was updated successfully, but these errors were encountered:

joshgoebel · 2019-11-12T07:15:09Z

And when we're thinking about naming it might be helpful to keep things like CoffeeScript and LiveScript in mind also:

LiveScript is a fork of Coco and an indirect descendant of CoffeeScript, with which it has much compatibility.

This is another tie:

│ coffeescript │ coffeescript │ 26 │ livescript │ 26 │ Relevance match. │

We're only saved from error because of alphabetical ordering. But our CoffeeScript snippet actually compiles as-is to LiveScript... so really I think the same type of relationship could apply here. If the parser gets "confused" about the difference between a snippet of CoffeeScript and LiveScript it should be smart enough to know that's "OK".

Perhaps:

LiveScript verySimilarTo: 'livescript'

And that would allow matches to be generally interchangeable between the two. It's also possible the tests themselves might be a better place to encode "similar to", but then when I think about seamless integration with 3rd party grammars (perhaps including running a much larger set of auto detect tests) I lean a little towards embedding this in the grammar itself (since it's so small a detail).

egor-rogov · 2019-11-13T00:02:22Z

If two languages are detected to be related in this way then the auto-detect would simply ignore the superset in cases of EQUAL matches, only preferred it if it had a HIGHER match (which is what you'd expect if it was the superset, and not the original).

Hmm... C++ is a superset for Arduino (Arduino adds some new keywords to C++). Hence, Arduino relevance match is always greater or equal to C++ relevance match, for any code snippet. Hence, in case of equal scores you should assume C++ (superset), not Arduino. Am I wrong?

joshgoebel · 2019-11-13T02:33:23Z

Yes. That’s why I’m saying the specialized case should be safe to ignore if none of it’s rules result in a more specialized match. Though I’m really mixing two things here a bit.

First: There is autodetect in a very general/philosophical sense. “This short code snippet looks like C++ but it also looks very much like Arduino. It could possibly be either.” I suppos sometimes that might be useful. But really if it has no Arduino-isms is it Ardunio code in a meaningful sense for us? I think the same thinking applies to XML and Django.

If this first level is actually useful from an API standpoint I am more than happy to think about how to preserve this.

Secondly: I’m trying to make the auto detect/tests a lot smarter about actual “conflicts” between languages. If you have a matching relevance score for C++ or Arduino that’s NOT a real problem. It honestly doesn’t matter which won. It should be expected and not result in test failure.

Meanwhile, a matching relevance score between say JavaScript and SQL might actually be a big problem - poor auto detection, etc. those are the things I want to get better at focusing on.

egor-rogov · 2019-11-20T22:47:10Z

I think I'm a bit lost.

First: There is autodetect in a very general/philosophical sense. “This short code snippet looks like C++ but it also looks very much like Arduino. It could possibly be either.” I suppos sometimes that might be useful. But really if it has no Arduino-isms is it Ardunio code in a meaningful sense for us?

For me, the answer is "No, it is not Arduino, give me C++ please." But, as you mentioned:

If you have a matching relevance score for C++ or Arduino that’s NOT a real problem. It honestly doesn’t matter which won.

So, what is superset for then?

Meanwhile, a matching relevance score between say JavaScript and SQL might actually be a big problem - poor auto detection, etc. those are the things I want to get better at focusing on.

Right. And how superset can help?

joshgoebel · 2019-11-20T22:55:19Z

So, what is superset for then?

To resolve the ambiguity of cases like C++ score:20 vs Arduino score:20. Whichever comes last or first alphabetically is a pretty terrible way to do it (and that's what we do now). Our tests passing or failing should NOT depend solely on the ordering of the data or naming of the languages.

If we decided to rename "XML" to "HTML" right now a whole bunch of tests might suddenly fail based on the renaming alone. This should not be the case.

Right. And how superset can help?

It's not designed to solve that problem. One thing at a time.

egor-rogov · 2019-11-20T23:00:16Z

Aha. Ok then. It makes sense.

joshgoebel · 2019-11-20T23:01:08Z

We could also call it requires or dependsOn and express it that way... and then the "base" class would always win in cases of exact matches... I told you I wasn't sure on the naming. :-)

egor-rogov · 2019-11-21T23:25:35Z

I personally have no problems with superset, but it requires a bit of mathematical background, so probably another name will serve better. Both requires and depensOn don't look good to me, because they tell about implementation details, not about relationship between languages. base or baseLang seems better in that respect. Also parent comes to mind.

joshgoebel · 2019-11-22T12:10:54Z

I like baseLanguage.

joshgoebel · 2019-11-22T12:23:23Z

@egor-rogov How would you feel if as part of this we dinged a language instead of just removing it... So I'd add a scoredMatches key (or similar) and you would have a list of languages you could introspect. We could also add a base: true key or similar (for the results) to let someone process them as made the most sense to them... this would pretty easily allow for those use cases where someone might actually want to know "This could be Arduino or CPP". (maybe for a UI or something)

[
{ score: 46, language: "XML", ... }
{ score: 45.9, language: "Django", base: "XML", ... } // been dinged because it's not a base language
{ score: 45.9, language: "Handlebars", base: "XML", ... } // been dinged because it's not a base language
{ score: 45.9, language: "ERB", base: "XML",... } // been dinged because it's not a base language
]

The existing top match/second match would just become the first two items from this new list.

egor-rogov · 2019-11-29T18:53:20Z

I didn't quite understand how you're going to use it, but it's fine with me.

- *technically* exact relevance ties in auto-detect are bad - HLJS resolves ties in _alphabetic order_ and the fact that they currently work is more a matter of luck than anything else - this now bubbles up such issues when the checker tool is run Eventually something like highlightjs#2275 should improve matters.

- Entire library now no longer uses any run-time dependencies (no grammar will break because another is missing) - Add docs for fixMarkup deprecation - Add docs for requireLanguage deprecation - Remove `Requires:` meta-data used only for run-time dependencies. - Gzip size is only ~230 bytes larger than before - Fix auto-detect tests now that the equivalency between `cpp` and `arduino` is no longer hidden by the load order. - adds `supersetOf` to resolve conflicts between very similar languages (C++, Arduino). The base language wins in the case of a tie. ie `arduino` is now a `supersetOf` `cpp`. (closes #2275) Note: `Requires: ` still stays overall as it's used to clue the build process about "loose" (non-breaking) sublanguage dependencies between languages. This dependency is purposely loose and therefore still a thing.

joshgoebel added parser auto-detect Issue with auto detection of language type enhancement An enhancement or new feature labels Nov 12, 2019

joshgoebel changed the title ~~Add superset to language grammar~~ Add baseLangage to language grammars Nov 22, 2019

joshgoebel mentioned this issue Dec 6, 2019

Version 10 Breaking Changes #2273

Closed

15 tasks

joshgoebel added this to the 10.0 milestone Dec 7, 2019

joshgoebel modified the milestones: 10.0, 10.1 Dec 24, 2019

joshgoebel mentioned this issue Jan 31, 2020

Freezing issue with coffeescript regex under some circumstances scrivo/highlight.php#66

Closed

joshgoebel removed this from the 10.1 milestone Apr 21, 2020

joshgoebel added the discuss/propose Proposal for a new feature/direction label Apr 30, 2020

joshgoebel mentioned this issue Oct 18, 2020

Language classification should be configurable via plugin #2768

Open

joshgoebel mentioned this issue Nov 11, 2020

(chore) Deprecate the requireLanguage API #2844

Merged

4 tasks

joshgoebel closed this as completed in #2844 Nov 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `baseLangage` to language grammars #2275

Add `baseLangage` to language grammars #2275

joshgoebel commented Nov 12, 2019 •

edited

Loading

joshgoebel commented Nov 12, 2019 •

edited

Loading

egor-rogov commented Nov 13, 2019

joshgoebel commented Nov 13, 2019 •

edited

Loading

egor-rogov commented Nov 20, 2019

joshgoebel commented Nov 20, 2019 •

edited

Loading

egor-rogov commented Nov 20, 2019

joshgoebel commented Nov 20, 2019

egor-rogov commented Nov 21, 2019 •

edited

Loading

joshgoebel commented Nov 22, 2019

joshgoebel commented Nov 22, 2019 •

edited

Loading

egor-rogov commented Nov 29, 2019

Add baseLangage to language grammars #2275

Add baseLangage to language grammars #2275

Comments

joshgoebel commented Nov 12, 2019 • edited Loading

joshgoebel commented Nov 12, 2019 • edited Loading

egor-rogov commented Nov 13, 2019

joshgoebel commented Nov 13, 2019 • edited Loading

egor-rogov commented Nov 20, 2019

joshgoebel commented Nov 20, 2019 • edited Loading

egor-rogov commented Nov 20, 2019

joshgoebel commented Nov 20, 2019

egor-rogov commented Nov 21, 2019 • edited Loading

joshgoebel commented Nov 22, 2019

joshgoebel commented Nov 22, 2019 • edited Loading

egor-rogov commented Nov 29, 2019

Add `baseLangage` to language grammars #2275

Add `baseLangage` to language grammars #2275

joshgoebel commented Nov 12, 2019 •

edited

Loading

joshgoebel commented Nov 12, 2019 •

edited

Loading

joshgoebel commented Nov 13, 2019 •

edited

Loading

joshgoebel commented Nov 20, 2019 •

edited

Loading

egor-rogov commented Nov 21, 2019 •

edited

Loading

joshgoebel commented Nov 22, 2019 •

edited

Loading