Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate conformance test suite #1678

Merged
merged 24 commits into from
Apr 5, 2024
Merged

Conversation

JelleZijlstra
Copy link
Member

Currently, the conformance test suite relies on manual scoring of results. This is laborious and error-prone, because it means we have to manually compare long lists of expected and observed error output.

This PR sets up an alternative system:

  • Test cases indicate lines where an error is expected to occur with "# E". Optional errors are indicated with "# E?".
  • We parse the output of each type checker to find the lines where it reports an error.
  • We compare the expected and actual errors and write the result to the TOML file.
  • The test passes if the two sets match.

I ran into a few categories of issues:

  • Expected errors are not always marked consistently in the conformance test suite. I replaced many comments of the form "# Type Error" or similar with "# E", and manually added "# E" where necessary, but a few such cases may remain.
  • Type checkers may report multiple errors for one line. I initially tried to accommodate this by allowing multiple "# E" comments per line, but this felt too burdensome.
  • Type checkers may report errors on different lines. In particular, mypy and pyright appear to differ in whether certain decorator-related errors are reported on the decorator line or the next line. I haven't come up with a solution for this yet, maybe a special comment that means "error on either this line or the next".

I checked for issues by comparing the new code's decision on whether a type checker is conformant with the manual scoring in the existing files. The script unexpected_fails.py prints files where the result differs.

The following issues remain:

mypy/dataclasses_usage.toml: Pass vs. Fail
mypy/directives_version_platform.toml: Pass vs. Fail
mypy/namedtuples_define_functional.toml: Pass vs. Fail
mypy/overloads_basic.toml: Pass vs. Fail
mypy/qualifiers_final_decorator.toml: Pass vs. Fail
mypy/specialtypes_never.toml: Pass vs. Fail
mypy/typeddicts_alt_syntax.toml: Pass vs. Fail
mypy/typeddicts_type_consistency.toml: Pass vs. Fail
pyre/annotations_typeexpr.toml: Pass vs. Fail
pyre/dataclasses_kwonly.toml: Pass vs. Fail
pyre/directives_no_type_check.toml: Pass vs. Fail
pyre/directives_reveal_type.toml: Unsupported vs. Pass
pyre/namedtuples_define_functional.toml: Pass vs. Fail
pyre/typeddicts_operations.toml: Pass vs. Fail
pyright/classes_override.toml: Pass vs. Fail
pyright/dataclasses_frozen.toml: Pass vs. Fail
pyright/dataclasses_kwonly.toml: Pass vs. Fail
pyright/dataclasses_slots.toml: Pass vs. Fail
pyright/dataclasses_transform_field.toml: Pass vs. Fail
pyright/dataclasses_transform_func.toml: Pass vs. Fail
pyright/generics_defaults.toml: Pass vs. Fail
pyright/generics_defaults_referential.toml: Pass vs. Fail
pyright/generics_paramspec_semantics.toml: Pass vs. Fail
pyright/generics_self_basic.toml: Pass vs. Fail
pyright/generics_syntax_declarations.toml: Pass vs. Fail
pyright/generics_syntax_infer_variance.toml: Pass vs. Fail
pyright/generics_syntax_scoping.toml: Pass vs. Fail
pyright/generics_typevartuple_basic.toml: Pass vs. Fail
pyright/generics_variance.toml: Pass vs. Fail
pyright/literals_literalstring.toml: Pass vs. Fail
pyright/namedtuples_define_class.toml: Pass vs. Fail
pyright/namedtuples_usage.toml: Pass vs. Fail
pyright/narrowing_typeis.toml: Pass vs. Fail
pyright/overloads_basic.toml: Pass vs. Fail
pyright/protocols_generic.toml: Pass vs. Fail
pyright/protocols_runtime_checkable.toml: Pass vs. Fail
pyright/protocols_variance.toml: Pass vs. Fail
pyright/qualifiers_final_decorator.toml: Pass vs. Fail
pyright/specialtypes_never.toml: Pass vs. Fail
pyright/tuples_unpacked.toml: Pass vs. Fail
pytype/dataclasses_usage.toml: Pass vs. Fail
pytype/directives_assert_type.toml: Pass vs. Fail
pytype/directives_no_type_check.toml: Pass vs. Fail
pytype/directives_version_platform.toml: Pass vs. Fail
pytype/generics_scoping.toml: Pass vs. Fail
pytype/generics_upper_bound.toml: Pass vs. Fail
pytype/namedtuples_define_functional.toml: Pass vs. Fail
pytype/typeddicts_inheritance.toml: Pass vs. Fail

I found a number of inconsistencies in the test suite this way already, and there may be more lurking in the remaining mismatches.

I'll continue to chip away at the inconsistencies as I have time, but in the meantime I'm opening up this PR for discussion.

If we agree on the framework, we can merge this PR even if some work is still remaining, so we can avoid constant merge conflicts. Then we can eventually replace the current manual scoring completely once we're confident in the automatic scoring.

@erictraut
Copy link
Collaborator

Thanks for exploring this. I think it's promising, but I'm still a bit skeptical of this approach.

I agree that manual scoring is laborious the first time a test is written, but after that it's almost no work because it's based on output deltas, and those rarely change.

I think any approach is going to require some human interpretation of the results. If we want the results to be readable by type checker users and authors, it's better to present a summary of the issues, no? The output of the proposed approach will be very difficult to interpret if there's no summary.

Maybe a hybrid of the current approach and your proposed approach strikes the right balance? I'm thinking that the automation could help during scoring, but the scoring process still has a human in the loop. Or maybe that's what you're proposing already?

In addition to the list of categories that you identified, I can add the following:

  • Some type checkers perform additional checks that generate errors. These are OK and don't violate the typing spec. Pyre, for example, emits an error if an attribute is declared but is not assigned a value within that class implementation. None of the other type checkers do this. I guess we could use # type: ignore in these cases. I'm not a fan of # type: ignore because it increases fragility, and I'd prefer not to use it in a test suite.
  • The presence or absence of some errors depend on behaviors that have not yet been specified in the typing spec. I've tried to minimize this situation in the test code, but there are a few places where it was unavoidable. I added copious human-readable comments in these few cases. Automating this might be difficult. I suppose we could choose some alternate "comment code" for cases like the above.

I'm OK merging the PR in its current form as long as it doesn't affect the reported results and doesn't get in the way of continued work on the conformance test. Even if we don't end up adopting this framework, standardizing on "E" (or some other similar comment format) is a change I'd love to see merged.

@JelleZijlstra
Copy link
Member Author

Thanks. I agree that we can't move fully to automated scoring; maybe the right long-term state is that passing tests are fully automated, but for failing tests we require a human to add a "notes" field describing the missing features and another field summarizing the support (e.g., "Unsupported" or "Partial").

For now the PR merely adds fields that are ignored in the summary report; we can leave changes to the summary report for a later PR.

You are right about the two other problematic areas:

  • Type checkers generating additional errors: I encountered this in aliases_explicit.py where mypy complains on BadTypeAlias12: TA = list or set that list is always truthy. In this case, I added a mypy directive to turn off the specific error code, but that won't always be possible. Fortunately, such cases are relatively rare.
  • Unspecified behavior: I encountered this first in annotations_methods.py where mypy and pyright have different behavior and comments imply that either is valid. In this case, I handled it by adding "E?" (an optional error) on both lines.

@JelleZijlstra
Copy link
Member Author

For visibility (since the PR is huge), I changed the following conformance judgments:

--- a/conformance/results/mypy/generics_typevartuple_unpack.toml
+++ b/conformance/results/mypy/generics_typevartuple_unpack.toml
@@ -1,7 +1,7 @@
-conformant = "Partial"
-notes = """
-Does not reject multiple unpack operators in a tuple.
-"""
+conformant = "Pass"

This test case does not contain a case where there are multiple unpacks in a tuple.

--- a/conformance/results/pyre/annotations_typeexpr.toml
+++ b/conformance/results/pyre/annotations_typeexpr.toml
@@ -1,4 +1,7 @@
-conformant = "Pass"
+conformant = "Partial"
+notes = """
+Rejects some generics
+"""

The errors were hard to interpret but pyre appeared to fail some of the assert_type() calls in this test.

--- a/conformance/results/pyre/directives_no_type_check.toml
+++ b/conformance/results/pyre/directives_no_type_check.toml
@@ -1,4 +1,4 @@
-conformant = "Pass"
+conformant = "Partial"
 notes = """
 Does not honor @no_type_check decorator.
 """

If it doesn't implement @no_type_check it does not pass this test.

--- a/conformance/results/pyre/namedtuples_define_functional.toml
+++ b/conformance/results/pyre/namedtuples_define_functional.toml
@@ -1,4 +1,4 @@
-conformant = "Pass"
+conformant = "Partial"
 notes = """
 Does not reject duplicate field names in functional form.
 Does not handle illegal named tuple names the same as runtime.

As the notes indicate there are some problems.

--- a/conformance/results/pyre/typeddicts_operations.toml
+++ b/conformance/results/pyre/typeddicts_operations.toml
@@ -1,4 +1,7 @@
-conformant = "Pass"
+conformant = "Partial"
+notes = """
+Does not reject `del` of required key.
+"""

This appears to have been missed in manual scoring.

--- a/conformance/results/pyright/directives_no_type_check.toml
+++ b/conformance/results/pyright/directives_no_type_check.toml
@@ -1,4 +1,4 @@
-conformant = "Pass"
+conformant = "Partial"
 notes = """
 Does not honor `@no_type_check` class decorator.
 """

As with pyre, if @no_type_check on classes is not supported, the test should not pass.

--- a/conformance/results/pytype/generics_upper_bound.toml
+++ b/conformance/results/pytype/generics_upper_bound.toml
@@ -1,4 +1,4 @@
-conformant = "Pass"
+conformant = "Partial"
 notes = """
 Does not properly support assert_type.
 """

Hard-to-interpret error but pytype appears to infer the wrong type for some cases in this test.

--- a/conformance/results/pytype/namedtuples_define_functional.toml
+++ b/conformance/results/pytype/namedtuples_define_functional.toml
@@ -1,4 +1,4 @@
-conformant = "Pass"
+conformant = "Partial"
 notes = """
 Does not handle illegal named tuple names the same as runtime.
 Does not support defaults in functional form.

Partial conformance since some features are not supported.

@erictraut if you agree with these changes I can merge this PR and make further changes in smaller, more focused PRs.

@JelleZijlstra JelleZijlstra marked this pull request as ready for review April 5, 2024 13:23
@erictraut
Copy link
Collaborator

As the notes indicate there are some problems.

The notes here are for behaviors that are optional in the typing spec, so it should say "Pass", not "Partial".

As with pyre, if @no_type_check on classes is not supported, the test should not pass.

Same as above. The typing spec indicates that support for @no_type_check on classes is optional, and the behavior is undefined. That means if a type checker doesn't implement support for @no_type_check on a class, it is still conformant with the spec (i.e. it passes), but I think it's still worth noting that it doesn't support @no_type_check on classes in the "notes".

Partial conformance since some features are not supported.

Same here. This should be considered a "Pass", not "Partial".

@JelleZijlstra
Copy link
Member Author

Thanks, I'll mark those errors as optional.

@JelleZijlstra
Copy link
Member Author

For the pyre namedtuple test, I'm not convinced pyre should be marked as passing since the extra errors pyre provides are quite confusing and probably shouldn't be allowed by the spec. However, that can be discussed separately; I'd like to merge this PR first and then open smaller, more focused PRs to discuss more controversial areas.

I revised the @no_type_check test to reflect the spec's current wording.

@JelleZijlstra JelleZijlstra merged commit bd85af0 into python:main Apr 5, 2024
4 checks passed
@JelleZijlstra JelleZijlstra deleted the conform branch April 5, 2024 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants