-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automate conformance test suite #1678
Conversation
- Avoid quadratic string concatenation - Avoid bare except - Avoid unnecessary subshells in subprocess calls
Thanks for exploring this. I think it's promising, but I'm still a bit skeptical of this approach. I agree that manual scoring is laborious the first time a test is written, but after that it's almost no work because it's based on output deltas, and those rarely change. I think any approach is going to require some human interpretation of the results. If we want the results to be readable by type checker users and authors, it's better to present a summary of the issues, no? The output of the proposed approach will be very difficult to interpret if there's no summary. Maybe a hybrid of the current approach and your proposed approach strikes the right balance? I'm thinking that the automation could help during scoring, but the scoring process still has a human in the loop. Or maybe that's what you're proposing already? In addition to the list of categories that you identified, I can add the following:
I'm OK merging the PR in its current form as long as it doesn't affect the reported results and doesn't get in the way of continued work on the conformance test. Even if we don't end up adopting this framework, standardizing on "E" (or some other similar comment format) is a change I'd love to see merged. |
Thanks. I agree that we can't move fully to automated scoring; maybe the right long-term state is that passing tests are fully automated, but for failing tests we require a human to add a "notes" field describing the missing features and another field summarizing the support (e.g., "Unsupported" or "Partial"). For now the PR merely adds fields that are ignored in the summary report; we can leave changes to the summary report for a later PR. You are right about the two other problematic areas:
|
No longer necessary since I no longer count errors.
For visibility (since the PR is huge), I changed the following conformance judgments:
This test case does not contain a case where there are multiple unpacks in a tuple.
The errors were hard to interpret but pyre appeared to fail some of the
If it doesn't implement
As the notes indicate there are some problems.
This appears to have been missed in manual scoring.
As with pyre, if
Hard-to-interpret error but pytype appears to infer the wrong type for some cases in this test.
Partial conformance since some features are not supported. @erictraut if you agree with these changes I can merge this PR and make further changes in smaller, more focused PRs. |
The notes here are for behaviors that are optional in the typing spec, so it should say "Pass", not "Partial".
Same as above. The typing spec indicates that support for
Same here. This should be considered a "Pass", not "Partial". |
Thanks, I'll mark those errors as optional. |
For the pyre namedtuple test, I'm not convinced pyre should be marked as passing since the extra errors pyre provides are quite confusing and probably shouldn't be allowed by the spec. However, that can be discussed separately; I'd like to merge this PR first and then open smaller, more focused PRs to discuss more controversial areas. I revised the |
Currently, the conformance test suite relies on manual scoring of results. This is laborious and error-prone, because it means we have to manually compare long lists of expected and observed error output.
This PR sets up an alternative system:
I ran into a few categories of issues:
I checked for issues by comparing the new code's decision on whether a type checker is conformant with the manual scoring in the existing files. The script
unexpected_fails.py
prints files where the result differs.The following issues remain:
I found a number of inconsistencies in the test suite this way already, and there may be more lurking in the remaining mismatches.
I'll continue to chip away at the inconsistencies as I have time, but in the meantime I'm opening up this PR for discussion.
If we agree on the framework, we can merge this PR even if some work is still remaining, so we can avoid constant merge conflicts. Then we can eventually replace the current manual scoring completely once we're confident in the automatic scoring.