-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add baseLangage
to language grammars
#2275
Comments
And when we're thinking about naming it might be helpful to keep things like CoffeeScript and LiveScript in mind also:
This is another tie:
We're only saved from error because of alphabetical ordering. But our CoffeeScript snippet actually compiles as-is to LiveScript... so really I think the same type of relationship could apply here. If the parser gets "confused" about the difference between a snippet of CoffeeScript and LiveScript it should be smart enough to know that's "OK". Perhaps: LiveScript And that would allow matches to be generally interchangeable between the two. It's also possible the tests themselves might be a better place to encode "similar to", but then when I think about seamless integration with 3rd party grammars (perhaps including running a much larger set of auto detect tests) I lean a little towards embedding this in the grammar itself (since it's so small a detail). |
Hmm... C++ is a superset for Arduino (Arduino adds some new keywords to C++). Hence, Arduino relevance match is always greater or equal to C++ relevance match, for any code snippet. Hence, in case of equal scores you should assume C++ (superset), not Arduino. Am I wrong? |
Yes. That’s why I’m saying the specialized case should be safe to ignore if none of it’s rules result in a more specialized match. Though I’m really mixing two things here a bit. First: There is autodetect in a very general/philosophical sense. “This short code snippet looks like C++ but it also looks very much like Arduino. It could possibly be either.” I suppos sometimes that might be useful. But really if it has no Arduino-isms is it Ardunio code in a meaningful sense for us? I think the same thinking applies to XML and Django. If this first level is actually useful from an API standpoint I am more than happy to think about how to preserve this. Secondly: I’m trying to make the auto detect/tests a lot smarter about actual “conflicts” between languages. If you have a matching relevance score for C++ or Arduino that’s NOT a real problem. It honestly doesn’t matter which won. It should be expected and not result in test failure. Meanwhile, a matching relevance score between say JavaScript and SQL might actually be a big problem - poor auto detection, etc. those are the things I want to get better at focusing on. |
I think I'm a bit lost.
For me, the answer is "No, it is not Arduino, give me C++ please." But, as you mentioned:
So, what is
Right. And how |
To resolve the ambiguity of cases like C++ score:20 vs Arduino score:20. Whichever comes last or first alphabetically is a pretty terrible way to do it (and that's what we do now). Our tests passing or failing should NOT depend solely on the ordering of the data or naming of the languages. If we decided to rename "XML" to "HTML" right now a whole bunch of tests might suddenly fail based on the renaming alone. This should not be the case.
It's not designed to solve that problem. One thing at a time. |
Aha. Ok then. It makes sense. |
We could also call it |
I personally have no problems with |
I like |
superset
to language grammarbaseLangage
to language grammars
@egor-rogov How would you feel if as part of this we dinged a language instead of just removing it... So I'd add a [
{ score: 46, language: "XML", ... }
{ score: 45.9, language: "Django", base: "XML", ... } // been dinged because it's not a base language
{ score: 45.9, language: "Handlebars", base: "XML", ... } // been dinged because it's not a base language
{ score: 45.9, language: "ERB", base: "XML",... } // been dinged because it's not a base language
] The existing top match/second match would just become the first two items from this new list. |
I didn't quite understand how you're going to use it, but it's fine with me. |
- *technically* exact relevance ties in auto-detect are bad - HLJS resolves ties in _alphabetic order_ and the fact that they currently work is more a matter of luck than anything else - this now bubbles up such issues when the checker tool is run Eventually something like highlightjs#2275 should improve matters.
- Entire library now no longer uses any run-time dependencies (no grammar will break because another is missing) - Add docs for fixMarkup deprecation - Add docs for requireLanguage deprecation - Remove `Requires:` meta-data used only for run-time dependencies. - Gzip size is only ~230 bytes larger than before - Fix auto-detect tests now that the equivalency between `cpp` and `arduino` is no longer hidden by the load order. - adds `supersetOf` to resolve conflicts between very similar languages (C++, Arduino). The base language wins in the case of a tie. ie `arduino` is now a `supersetOf` `cpp`. (closes #2275) Note: `Requires: ` still stays overall as it's used to clue the build process about "loose" (non-breaking) sublanguage dependencies between languages. This dependency is purposely loose and therefore still a thing.
From my tweaked language checker:
These aren't surprising, but also aren't very helpful. They would be reported as errors except for the load order that's enforced by the requires... and the fact that we magically prefer the "first match" even if the scores are the same.
I'd like to add an explicit key to explain these relationships to the parser.
So that Arduino would be a
superset: 'cpp'
and Django would be asuperset: 'xml'
. I'm open to better naming. If two languages are detected to be related in this way then the auto-detect would simply ignore the superset in cases of EQUAL matches, only preferred it if it had a HIGHER match (which is what you'd expect if it was the superset, and not the original).Also if you're looking to correct an incorrect match it's not helpful to know the second match to cpp was Arduino, they are essentially the same (not going to color differently, etc). It would be useful to know the second match (OTHER than Arduino) was say Pascal.
That's the change I'm suggested here if no one is opposed. Should only be a few lines of code and adding the appropriate attribute to a few grammars.
The text was updated successfully, but these errors were encountered: