Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor error recovery after missing > #24642

Closed
sharwell opened this issue Feb 5, 2018 · 23 comments · Fixed by #69734
Closed

Poor error recovery after missing > #24642

sharwell opened this issue Feb 5, 2018 · 23 comments · Fixed by #69734
Labels
Area-Compilers Bug Developer Community The issue was originally reported on https://developercommunity.visualstudio.com help wanted The issue is "up for grabs" - add a comment if you are interested in working on it
Milestone

Comments

@sharwell
Copy link
Member

sharwell commented Feb 5, 2018

Reported at Link
Version Used: 15.6 Preview 3

Steps to Reproduce:

using System;
using System.Collections.Generic;
using System.Linq;

namespace Microsoft.Example
{
    class Class1
    {
        public static IEnumerable<(String Value, String Description) GetAllValues(Type t)
        {
            if (!t.IsEnum)
                throw new ArgumentException("no good");

            return Enum.GetValues(t).Cast<Enum>().Select(e => (e.ToString(), e.ToString()));
        }
    }
}

Expected Behavior:

A message saying the > is missing between Description) and GetAllValues.

Actual Behavior:

48 errors, and an editor that looks like this:

image

@sharwell sharwell added Bug help wanted The issue is "up for grabs" - add a comment if you are interested in working on it Area-Compilers Developer Community The issue was originally reported on https://developercommunity.visualstudio.com labels Feb 5, 2018
@Neme12
Copy link
Contributor

Neme12 commented Feb 5, 2018

Maybe it would help a little bit (it would definitely help in other cases as well) if the parser recognized a block directly inside a type declaration and parsed it as a BlockSyntax. Right now it skips the opening curly brace and eats the closing one.

@Neme12
Copy link
Contributor

Neme12 commented Feb 5, 2018

I made a prototype of this:
image

to be clear: this does not address the underlying issue of the missing >. The method signature isn't parsed correctly. This just provides a way to gracefully handle the situation when we do fail to parse the signature and if the parser finds a block directly inside a class. But there is a concern here that the error is too small and easy to miss given that the whole block is invalid and we couldn't parse a method at all.

I'm not sure how the > could be addressed. If we insert a missing > after parsing a complete type, it would harm recovery in the case where you forgot to type a , instead - it would presume you meant a >. This seems like a hard problem to solve.

@sharwell
Copy link
Member Author

sharwell commented Feb 5, 2018

I'm not sure how the > could be addressed. If we insert a missing > after parsing a complete type, it would harm recovery in the case where you forgot to type a , instead - it would presume you meant a >. This seems like a hard problem to solve.

An astute observation! While working on ANTLR 4, we found that a preference to close a nested construct (e.g. insert a missing >) typically provided better large-scale error handling than a preference to stay inside a looping construct (e.g. insert a missing ,).

@Neme12
Copy link
Contributor

Neme12 commented Feb 5, 2018

@sharwell True, it does seem a lot less likely that you'd be missing a comma, except when you're writing in a new parameter. Maybe the case of a missing closing bracket/paren was considered less important because the IDE automatically inserts them as you're typing?

As a separate issue, it might make sense to investigate if the parser could make less distinction between scripting/regular syntax and push those errors to the semantic model (which already reports ERR_GlobalStatement), if it doesn't come at a performance cost. That also sounds like good general strategy for error recovery because if there's illegal syntax found in a certain context, it's likely it might be a legal construct in some place in the language. It's better to parse it and then tell them they're doing something wrong with 1 clear error message rather than having 50 errors for every single token.

@CyrusNajmabadi
Copy link
Member

As a separate issue, it might make sense to investigate if the parser could make less distinction between scripting/regular syntax and push those errors to the semantic model (which already reports ERR_GlobalStatement), if it doesn't come at a performance cost.

That's the approach we've been taking for a while. We even codified it midway last year when i was doing PRs like: 001294b#diff-ac2b4121f27df06f497115cc68277654

In general, the strategy to take is to try to get the parser to only try to fit the code to the Syntax-Model. If you can't fit the code into the SyntaxModel, you definitely error. Once you can fit though, extra checks are performed during semantic analysis instead.

@CyrusNajmabadi
Copy link
Member

I'm not sure how the > could be addressed. If we insert a missing > after parsing a complete type, it would harm recovery in the case where you forgot to type a , instead - it would presume you meant a >. This seems like a hard problem to solve.

It's not too tough in practice. Because our parser is hand-written recursive descent, we have the easy ability to write whatever code we want during error recovery. So we can examine the surrounding state to try to figure out which sort of issue it it and make the right choice based on what we see around it.

By default there is a baked in algorithm that is both correct, and pretty reasonable. But when it produces suboptimal results, it's fine to provide a specialized error resolution strategy.

@Neme12
Copy link
Contributor

Neme12 commented Feb 12, 2018

In general, the strategy to take is to try to get the parser to only try to fit the code to the Syntax-Model. If you can't fit the code into the SyntaxModel, you definitely error. Once you can fit though, extra checks are performed during semantic analysis instead.

My suggestion was that the parser would parse out GlobalStatementSyntax, just like it does in the scripting dialect.

Because our parser is hand-written recursive descent, we have the easy ability to write whatever code we want during error recovery

Of course, I just wasn't sure what it should do in this case or how it can figure out which sort of issue this it. GetAllValues could be a type after all. Do you have in mind some sort of lookahead from the point after the tuple ) where it expected either > or, to look if there's a matching > for the last < and if there is, insert a comma just like today but if there isn't, insert a > instead?

@AdamSpeight2008
Copy link
Contributor

I think there is an option with the syntax.xml to insert if missing. If so that would good option.

@CyrusNajmabadi
Copy link
Member

Of course, I just wasn't sure what it should do in this case or how it can figure out which sort of issue this it. GetAllValues could be a type after all. Do you have in mind some sort of lookahead from the point after the tuple ) where it expected either > or, to look if there's a matching > for the last < and if there is, insert a comma just like today but if there isn't, insert a > instead?

Honestly, you'd just have to write up a bunch of cases, and then see what you could do with the code. Ideally, whatever strategy you ended up implementing would be fairly lightweight and would not be too complex.

In other words: error recovery is an unending set of work in the parser. We tend to drive it based on real use cases where things go bad, combined with reasonable tweaks (i.e. looking at the local area, or doing small bits of speculative lookahead) that improve these cases without regressing anything else.

@sharwell sharwell added 4 - In Review A fix for the issue is submitted for review. and removed help wanted The issue is "up for grabs" - add a comment if you are interested in working on it labels Feb 17, 2018
@jinujoseph jinujoseph added this to the 15.8 milestone Feb 19, 2018
@jcouv jcouv added help wanted The issue is "up for grabs" - add a comment if you are interested in working on it and removed 4 - In Review A fix for the issue is submitted for review. labels Jul 4, 2018
@jcouv jcouv modified the milestones: 15.8, 16.0 Jul 4, 2018
@YairHalberstadt
Copy link
Contributor

YairHalberstadt commented Jan 3, 2019

Taking a quick look at a bunch of cases:

In a case where we expect either a > or a , token, then assuming we are missing a , when we are actually missing a > leads to terrible error recovery. Assuming we are missing a > when we are actually missing a , isn't brilliant, but tends to only cause localized errors.

As such the obvious first step is to assume we are missing a > not a , in such situations.

The same applies for > and . (i.e. List<A B might be List<A.B> or List<A> B)

If we lookahead and find a > without a matching <, maybe we could decide that actually a , was in order. I don't think there's any scenario where we would assume we were missing a . rather than a , as they lead to equivalent error recovery, are indistinguishable syntactically if the token is missing, and the latter is probably more common than the former.

@CyrusNajmabadi
Copy link
Member

Taking a quick look at a bunch of cases:

In a case where we expect either a > or a , token, then assuming we are missing a , when we are actually missing a > leads to terrible error recovery. Assuming we are missing a > when we are actually missing a , isn't brilliant, but tends to only cause localized errors.

Could you expand on this? Why is this the case? Say the user really is missing a comma in a list, and you put in the >, won't that make the rest of the type-arg list just bogus stuff that will throw off the parser in other ways?

If we lookahead and find a > without a matching <,

Can you clarify what you mean by 'lookAhead' here?

@YairHalberstadt
Copy link
Contributor

@CyrusNajmabadi

Could you expand on this? Why is this the case? Say the user really is missing a comma in a list, and you put in the >, won't that make the rest of the type-arg list just bogus stuff that will throw off the parser in other ways?

compare https://sharplab.io/#v2:D4AQDABCCMDcCwAoEBmKAmCBhCBvJEhUaAIgJYDGALmQPYB2AhgE4CeAPGfVQHxTRg+AWQAUASjwEi0qYQC+SOUA
with
https://sharplab.io/#v2:D4AQDABCCMDcCwAoEBmKAmCBhCBvJEhUaAMgJYDOALgDxkB2VEAsgBQCUeBRP3hAvkn5A===

They're both pretty terrible, but the Method declaration is at lease picked up in the first. Similarly for a missing ,`>` in argument lists. They're both pretty much as bad as each other in variable declarations.

So in a toss-up I would err on the side of adding a > token

@YairHalberstadt
Copy link
Contributor

YairHalberstadt commented Jan 3, 2019

Can you clarify what you mean by 'lookAhead' here?

I think a sensible algorithm might be to say that a GenericTypeArgumentList will always appear somewhere before one of:

;
=
{
=>
: (in generic constraints, although we could rely on '{' )
where (in generic constraints, although we could rely on '{' )
new (a common keyword in expressions/statements although we could rely on ';' )

We look ahead till the next such token, and if we are able to find a closing > token (counting opening < along the way) we assume we are missing a comma.

Else we assume we are missing a > token.

If we get past some constant number of tokens, we give up and assume it's a > given that such a long type argument list is extremely unlikely

Does that sound sensible?

At the cost of making the function longer (but since it's a switch, presumably not much less performant) we could add all the miriad SytntaxKinds that can't appear in a Generic Type Argument List, but I think the above syntax kinds will be sufficient.

@YairHalberstadt
Copy link
Contributor

Note that Type Parameter Lists are parsed entirely seperately in the parser, and so I would deal with them seperately to TypeArgumentLists

@CyrusNajmabadi
Copy link
Member

Note that Type Parameter Lists are parsed entirely seperately in the parser, and so I would deal with them seperately to TypeArgumentLists

FWIW, this feels like something that would be valuable for many lists. I've oft been surprised that the C# parser has a dedicated loop for each type of list construct. For example, when i wrote the TypeScript parser, we just had a single function for parsing lists (and one for parsing delimited lists), that way any improvements we made would be adopted across all lists:

https://github.com/Microsoft/TypeScript/blob/b7d7d5f7b39a5b9619c77590e5fe7f434ed68f1e/src/compiler/parser.ts#L1687-L1708

https://github.com/Microsoft/TypeScript/blob/b7d7d5f7b39a5b9619c77590e5fe7f434ed68f1e/src/compiler/parser.ts#L2046-L2112

It seems a shame that if we did this work here it would only apply to this type of list instead of getting the benefit for all lists.

@YairHalberstadt
Copy link
Contributor

@CyrusNajmabadi
Type parameter lists can only be one token per parameter, so fixing this for them is extremely simple - you just look ahead one token and see what you find.
I'd rather do that in a seperate PR to make things easier to review.

@YairHalberstadt
Copy link
Contributor

Some fun statistics about the length of TypeArgumentLists in CoreFX by token length:

Token Length Count Cumulative Percentage
3.00 76276 0.774604
4.00 525 0.779935
5.00 10098 0.882483
6.00 3829 0.921368
7.00 1355 0.935128
8.00 1953 0.954961
9.00 1267 0.967828
10.00 775 0.975698
11.00 578 0.981568
12.00 248 0.984087
13.00 417 0.988321
14.00 283 0.991195
15.00 110 0.992312
16.00 220 0.994547
17.00 46 0.995014
18.00 55 0.995572
19.00 32 0.995897
20.00 30 0.996202
21.00 53 0.99674
22.00 26 0.997004
23.00 29 0.997299
24.00 14 0.997441
25.00 2 0.997461
26.00 116 0.998639
28.00 23 0.998873
30.00 10 0.998974
31.00 3 0.999005
32.00 11 0.999116
34.00 23 0.99935
35.00 2 0.99937
36.00 3 0.999401
38.00 2 0.999421
39.00 14 0.999563
41.00 7 0.999634
43.00 5 0.999685
45.00 5 0.999736
46.00 1 0.999746
47.00 5 0.999797
48.00 1 0.999807
49.00 5 0.999858
51.00 7 0.999929
53.00 6 0.99999
56.00 1 1

So that gives us an idea for how far we would need to lookahead in the worst case scenario

@CyrusNajmabadi
Copy link
Member

Type parameter lists can only be one token per parameter, so fixing this for them is extremely simple - you just look ahead one token and see what you find.
I'd rather do that in a seperate PR to make things easier to review.

My point was about doing this fix for all separated lists, not just this list. :)

@YairHalberstadt
Copy link
Contributor

@CyrusNajmabadi
I don't see how the fix could be generalised for all seperated lists. The best I could see would be creating some type of interface along the lines of

interface IMissingTokenDisambuguatorStateMachine
{
     void Reset();
     DisambiguationResult FeedNextToken(SyntaxKind kind);
}

enum DisambiguationResult
{
    MissingSeperator,
    MissingCloser,
    NeedMoreInfo
}

And then having a method:

bool IsMissingSeperator(IMissingTokenDisambuguatorStateMachine disambiguator)
{
    var lookAheadCount = 0;
    disambiguator.Reset();
    while(true)
    {
         var currentKind = this.PeekToken(lookAheadCount).Kind;
         if(currentKind == SyntaxKind.EndOfFile)
             return true; 
         switch(disambiguator.FeedNextToken(currentKind))
         {
             case DisambiguationResult.MissingSeperator:
                 return true;
             case DisambiguationResult.MissingCloser:
                 return false;
             case DisambiguationResult.NeedMoreInfo:
                 lookAheadCount++;
             default:
                 throw new InvalidOperationException();
         }
    }
}

I don't know if this generalisation is useful enough to be worth it though.

@CyrusNajmabadi
Copy link
Member

I don't see how the fix could be generalised for all seperated lists.

I linked how we did it in TypeScript. It's very simple.

@YairHalberstadt
Copy link
Contributor

YairHalberstadt commented Jan 5, 2019

Also note there's no general way to improve error recovery inside a method if the '>' token is missing, as we usually assume we are dealing with an expression not a type when the '>' is missing.

for example

List<Int a = new List<Int>();

// will be parsed as 
List < Int; a = new List<Int>();

So this fix will be about cases where we can unambiguously parse it as a type.

@Rekkonnect
Copy link
Contributor

@CyrusNajmabadi did #69482 also indirectly fix this, or is it still a problem?

@CyrusNajmabadi
Copy link
Member

That fix would not help this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Compilers Bug Developer Community The issue was originally reported on https://developercommunity.visualstudio.com help wanted The issue is "up for grabs" - add a comment if you are interested in working on it
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants