Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate parser to the new span combining scheme #126763

Open
petrochenkov opened this issue Jun 20, 2024 · 9 comments
Open

Migrate parser to the new span combining scheme #126763

petrochenkov opened this issue Jun 20, 2024 · 9 comments
Assignees
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-macros Area: All kinds of macros (custom derive, macro_rules!, proc macros, ..) A-parser Area: The parsing of Rust source code to an AST C-cleanup Category: PRs that clean code up or issues documenting cleanup. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@petrochenkov
Copy link
Contributor

petrochenkov commented Jun 20, 2024

Summary:

I a, b, ..., z are multiple consecutive span nodes that need to be combined into a single new span node, then its combined span should be

span(a).to(span(b)).to(span(c))....to(span(z))`

"Span node" is either an individual token (every token has a span), or some larger AST node having a Span field.

Examples:

-10 // an expression
a + 11 // one more expression
a::<u8>::b // a path

The combined span for these pieces of code will be

// expressions are span nodes because they have their own spans in AST, `-` doesn't have a larger AST node so it's treated as a primitive token
span(expr(-10)) = span(token(-)).to(span(expr(10)));
span(expr(a + 11)) = span(expr(a)).to(span(token(+))).to(span(expr(11)));
// the whole generic arg is a span node because it has its own span in AST
span(path(a::<u8>::b)) = span(ident(a)).to(span(token(::))).to(span(generic_arg(<u8>))).to(span(token(::))).to(span(ident(b)));

Status quo:

Currently the resulting span is typically built like span_first_token.to(span_last_token).
So anything in the middle and the internal structure (i.e. nodes, as opposed to tokens) are ignored.

Why we need to change it:

The to operation will automatically take macro variables into account, and will try to put the resulting span into the best suitable macro context (this was implemented in #119673).
E.g. in $tt + 5 the combined expression span will be put into the context of the macro using $tt as a macro parameter.
Note, that the same thing often happens in the current parser as well, but not consistently, e.g. $a::$b will produce an incorrect resulting path span because the :: in the middle is not considered.
This will also give us some single relatively well predictable rule for combining AST spans.

Implementation:

This work should be parallelizable relatively well (but may require a one time initial setup).
I'll review PRs doing this, they can be assigned to me.
It may be convenient to have a rolling value in the Parser structure for span of the current (or previous?) span node.
The parser has a lot of bespoke diagnostic logic (including snapshotting) that stands in the way of any systematic improvements like this.

How this can be tested:

Make a macro that emits complex nodes using tokens from different contexts, e.g.

macro m($l:tt $op:tt $r:tt) {
    2 + 3;
    $l + 3;
    2 $op 3;
    2 + $r;
    $l $op 3;
    $l + $r;
    2 $op $r;
    $l $op $r;
}

m!(2 + 3);

and emit some diagnostic using those nodes' spans (maybe can add a special internal diagnostic for this testing).

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jun 20, 2024
@petrochenkov petrochenkov added A-diagnostics Area: Messages for errors, warnings, and lints A-parser Area: The parsing of Rust source code to an AST E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. A-macros Area: All kinds of macros (custom derive, macro_rules!, proc macros, ..) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jun 20, 2024
@petrochenkov
Copy link
Contributor Author

It may also be possible to introduce a smarter n-ary span combining operation to(a1, ..., aN) instead of the current binary to, but it may be more expensive and will make spans better only in very rare cases.
So combining using a chain of binary to operations should be fine for now.

@jieyouxu jieyouxu added C-cleanup Category: PRs that clean code up or issues documenting cleanup. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Jun 20, 2024
@petrochenkov
Copy link
Contributor Author

either an individual token

The "token" here is a "token tree" really, despite Rust parser working mostly with flattened token sequences at the moment.
So span(#[word]) should ideally be span(#) to span([word]) and not span(#) to span([) to span(word) to(span(]).

Also span combining for delimited groups can ignore all their internal tokens and only consider their delimiters.
Opening and closing delimiters always have the same context, and cannot be passed to macros separately in any way.
So span([word]) should be span([) to span(]) and not span([) to span(word) to span(]).

@DanielFrantes
Copy link

@rustbot assing

@DanielFrantes
Copy link

@rustbot claim

@DanielFrantes
Copy link

@petrochenkov do you mentor this?

@petrochenkov
Copy link
Contributor Author

@danik292 Yes

@DanielFrantes
Copy link

@petrochenkov ok where Is spans implemented?

@petrochenkov
Copy link
Contributor Author

I suggest implementing this for some specific AST nodes that are not too complex, e.g. for paths maybe (fn parse_path in rustc_parse), or for something in types (compiler\rustc_parse\src\parser\ty.rs).
The current (and/or previous) node span may need to be kept in struct Parser.

I'd also suggest starting with some testing infra, like a new attribute for showing spans in an AST node.

#[rustc_show_spans(expr)]
fn foo() {
  let x = a + b;
}

should show a diagnostic (probably a note) similar to this

#[rustc_show_spans(expr)]
fn foo() {
  let x = a + b;
          ^
              ^
          ^^^^^
}

See e.g. rustc_effective_visibility (in ./compiler) for an example of adding a new internal attribute.
The logic for showing spans should likely live in the rustc_ast_passes crate.

@noahmbright
Copy link
Contributor

Is the main issue that something like $a::$b would have an incorrect span? I haven't fully understood the problem yet, but if the span is like a start, a length, and a context, how does the problem come from ignoring the middle tokens?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-macros Area: All kinds of macros (custom derive, macro_rules!, proc macros, ..) A-parser Area: The parsing of Rust source code to an AST C-cleanup Category: PRs that clean code up or issues documenting cleanup. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants