-
Notifications
You must be signed in to change notification settings - Fork 19
[WIP] Implement (possibly?) the entire Rust grammar. #13
Conversation
I may have come up with a way to separate stable and unstable features: adjoining grammars - that is, making |
cc @petrochenkov @jseyfried @Manishearth @nikomatsakis @nrc @pnkfelix I should probably also spend some time reading previous attempts at a Rust grammar, to see if they hold some information I might be missing here. Note, however, that this particular PR is not an at attempt at disambiguating and fully restricting the grammar, to one specific parse tree (even if it does that 8% of the time already). (there's also the option of removing some of the restrictions, in unambiguous situations; I mean, it's almost 2019 and we still haven't standardized on fully general CFG parsing everywhere?!) |
grammar/item.g
Outdated
Variadic:"..." | | ||
RegularAndVariadic:{ args:FnArg+ % "," "," "..." }; | ||
FnArg = | ||
SelfValue:{ mutable:"mut"? "self" } | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This misses type ascription I think (which is stable in limited form)
struct S;
impl S {
fn f(self: S) {}
}
fn main() {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that that parses, but self
there is just a binding pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Centril do we yet have a proposal for "self" "@" Pat { ":" Type }?
, or were there previous attempts that died out? IMO it would be a natural extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eddyb that would just fall out from having self
be a valid pattern + intersection patterns; i.e.
pat ::= "self" | pat "@" pat | pat ":" type | ... ;
I've mentioned the possibility of making self
a valid pattern in https://github.com/Centril/rfcs/blob/rfc/generalized-type-ascription/text/0000-generalized-type-ascription.md#self-as-a-pattern but it isn't part of the proposal right now.
Am I correct that this grammar does not reflect peculiarities around "restrictions" ? That, that some expressions must be terminated by |
@matklad I think that's a long-term question, that this WG should figure out, eventually. @eternaleye suggested using boolean grammars to indicate such things, but even though it would be possible to extend the GLL implementation to handle conjunction and negation, natively, I would prefer if we waited for something based on rewriting the grammar to a CFG (similar to what you'd do with propagating the restriction down through grammar templates, but automated). We might introduce templates into the notation, but I'd rather not make a mess of things with it. If we end up with two grammars (because this initial one is pretty small and could be kept around anyway), I would want to have some sort of naive "inclusion check", that would ensure that the refined grammar parses strictly less than the simple, more general, one. |
I usually assign myself to PRs/issues I need to do something with, to not forget about them. |
@petrochenkov Oh that's annoying, this is a different org than |
grammar/generics.g
Outdated
WhereClause = "where" bounds:WhereBound* % "," ","?; | ||
WhereBound = | ||
Lifetime:{ lt:LIFETIME ":" bounds:LifetimeBound* % "+" "+"? } | | ||
Type:{ binder:ForAllBinder? ty:Type ":" bounds:TypeBound* % "+" "+"? } | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disturbing fact: the following is valid Rust syntax:
trait X {}
fn foo<T>()
where for<'a> for<'b> X: Clone
{}
however, syntactically these for
's are different: the first one is a part of where predicate, the second one is a part of the type. Our ambiguity testing should ideally expose this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yupp, we basically have a precedence rule that eagerly takes the first encountered for<'a>
and uses it for the whole T: A + B +...
.
Even more fun is the optional but not nestable parens around a bound.
where i32: (for<> Copy)
looks weird but is perfectly valid today!
@Centril suggested we should make it a proper expression grammar with parens and +
as a binop between bounds, which I feel pretty sympathetic towards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even more fun is the optional but not nestable parens around a bound.
Sorry, I tried to prevent that, but lang team forced the decision.
we should make it a proper expression grammar with parens and + as a binop between bounds
Why?
Bound + Bound + Bound + ...
is a list with separator, not multiple binary operators, it could use Bound, Bound, Bound, ...
or something equally well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
Well, it would explain the parens. And they do feel maybe like entities that can be arbitrarily composed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rules with the optional trailing comma don't seem to handle the case for zero elements. For example (,)
seems to parse.
I added notes about things that are unstable. libsyntax often parses them without complaint, but I think they should probably be noted somehow.
Apologies if I'm wrong about anything. Also, sometimes the line between syntactical and semantic rules seem to be a little blurry to me. My rule of thumb was whether or not it's accepted as a macro fragment, which may not be the best way to look at it, but allows you to easily say what the fragment matchers accept.
grammar/attr.g
Outdated
"(" TOKEN_TREE* ")" | | ||
"[" TOKEN_TREE* "]" | | ||
"{" TOKEN_TREE* "}"; | ||
"=" TOKEN_TREE | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using token_tree (and macroinput) here seems very broad. The grammar at https://github.com/rust-lang-nursery/reference/blob/master/src/attributes.md I think is correct and more specific. I'm not sure what would be a good way to express that literals cannot have suffixes, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehuss This is intended afaik. rust-lang/rust#55208.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, indeed, I had forgotten about that. Regardless, I think it would be good if the grammar eventually covered the stable syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK it has been stabilized for attribute proc macros, just not built-in ones?
@petrochenkov can you confirm?
As for specific built-in attributes, sure, we could provide grammars for each of them, but it's similar to having a grammar for e.g. format_args!
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Syntax with delimiters is stable and is rejected/gated in non-macro attributes with a semantic check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PATH = TOKEN_TREE
is mostly unstable, I think.
Key-value attributes are not allowed in macros, and gated for non-literals in non-macros, but gating happens after expansion, so something like #[doc = $doc]
where $doc: expr
is also on stable.
grammar/expr.g
Outdated
Range:{ start:Expr? ".." end:Expr? } | | ||
RangeInclusive:{ start:Expr? "..=" end:Expr } | | ||
Type:{ expr:Expr ":" ty:Type } | | ||
Cast:{ expr:Expr "as" ty:Type } | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type without bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type:{ expr:Expr ":" ty:Type } |
is also unstable (so leave a note).
grammar/macro.g
Outdated
@@ -0,0 +1,13 @@ | |||
MacroCall = path:Path "!" ident_input:IDENT? input:MacroInput; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was studying the parser code, this IDENT here had me stumped. Is this some old vestigial thing? I can't find a way to get it to actually parse, though.
Also, the type of path depends on where it is used. For the reference grammar, I just said it was a simple ("mod"-style) path, because in practice that's the only way it can be used. But from a parsing standpoint, I guess it could be more accurate. For example, foo<T>!{}
does not parse as an expression, but it does parse as a type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
macro_rules! name {
// ....
}
It depends on whether macro_rules
is considered a MacroCall
, I suppose. As far as I'm aware, only macro_rules!
has this syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, interesting.
What I was thinking of is that foo!bar{} is considered a valid statement in libsyntax (https://github.com/rust-lang/rust/blob/65e485d8f1d28102b426c9d6d82f835cd6470d3e/src/libsyntax/parse/parser.rs#L4711-L4714). I assumed that was what it was referring to. macro_rules
is parsed through a separate mechanism (eat_macro_def
). This "special identifier" was added 6 years ago, and I couldn't tell from the commit what it means. All other instances of parsing macro invocations do not check for that identifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parsing macro_rules
is not special, only when expanding it, does it allow the extra identifier.
A "macro definition" is the new macro
keyword
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm understanding it a little better. However, macro calls are parsed differently in various situations, and some don't allow the extra ident (type, pattern, trait/impl item, foreign item). Eventually these differences will need to be captured in the grammar, right?
Also, I think the syntax for macro_rules
should be captured somewhere, since it is useful for certain rust tooling, syntax highlighting, etc. It is now in the reference, but I'm hoping the wg-grammar and the reference can be closely tied if not identical. Maybe the reference will just have extra stuff beyond the wg-grammar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think the syntax for
macro_rules
should be captured somewhere
We could do this, especially since macro
definitions are not just macro invocations, and we could even have a converter from macro_rules
input patterns to CFGs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually these differences will need to be captured in the grammar, right?
Ideally we would find a way to keep adding rules without making the grammar unreadable and without losing the ability to encode into (E)BNF (automatically).
grammar/generics.g
Outdated
@@ -0,0 +1,35 @@ | |||
Generics = "<" params:GenericParam* % "," ","? ">"; | |||
GenericParam = attrs:OuterAttr* kind:GenericParamKind; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lifetimes can't be listed after types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's more of a semantic restriction, to be really honest.
The introduction of const
generics has made the old rules seem quite silly other thsn from a stylistic point of view. Most of the compiler has been fixed to not assume lifetimes come first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it actually were a semantic restriction then this would not be an error:
macro_rules! foo {
($x: item) => {}
}
foo! {
fn stuff<T, 'a>() {}
}
(Semantic vs. syntactic restrictions have semver consequences due to macros...)
And yeah... grammar wise this is super annoying, especially since const generic variables can be interspersed with type variables (afaik, or we haven't made any decision here...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we currently have the check duplicated (since the AST type loses the ordering requirement). @petrochenkov knows more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know anything anymore since @varkor rewrote everything for const generics.
Before that different kinds of parameters were stored separately in AST and the ordering was enforced solely in the parser.
grammar/generics.g
Outdated
|
||
ForAllBinder = "for" generics:Generics; | ||
|
||
WhereClause = "where" bounds:WhereBound* % "," ","?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows a bare ,
which I don't think is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, it's not valid! Hmm and neither is (,)
. I think I need to make a general note about this.
grammar/generics.g
Outdated
WhereClause = "where" bounds:WhereBound* % "," ","?; | ||
WhereBound = | ||
Lifetime:{ lt:LIFETIME ":" bounds:LifetimeBound* % "+" "+"? } | | ||
Type:{ binder:ForAllBinder? ty:Type ":" bounds:TypeBound* % "+" "+"? } | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this allows a bare +
which I don't think is allowed.
grammar/generics.g
Outdated
|
||
WhereClause = "where" bounds:WhereBound* % "," ","?; | ||
WhereBound = | ||
Lifetime:{ lt:LIFETIME ":" bounds:LifetimeBound* % "+" "+"? } | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this allows a bare +
which is not allowed?
@ehuss General note: this grammar doesn't include "restrictions" or any disambiguations in general. Practical declaractive constraints and disambiguations are an open question that this WG will have to come up with a solution for. The official implementation has often picked arbitrary constraints, in the name of being able to make decisions early, while complicating the formalism. |
General note about unstable features: I plan to actually split them out into a subdirectory and have a system for merging rules (I mean, We could even have one file per feature-gate, instead of duplicating the file structure of the stable grammar. |
@eddyb From the POV of language design (i.e. changing the grammar later) it seems more readable to me to have the feature gates integrated in the grammar where they naturally change things because feature gates sometimes affect multiple different parts of the grammar at the same time... I think a sort of pragma system might work well here, but for now we should just add comments saying that some alternation is unstable... |
@Centril I don't see how having them separate would hamper that? You just put them where they belong, and maybe simplify the definition to not include old special-cases. |
General note: I assumed optionally-trailing-separator separated lists were But now I'm not sure whether EDIT: @solson tried it out, and it looks like Rust is different than Perl6's EDIT2: I've been informed that the language Perl6 doesn't have lone commas (and some are speculating a bug might be letting |
@eddyb Maybe I'm not entirely understanding the suggestion... but I think having something like: ExprKind
= Literal:LITERAL
| Try:{ expr:Expr "?" }
| #[feature = "async"] -- invent concrete syntax corresponding to this...
AsyncBlock:{ "async" block:Block }
| #[feature = "try_block"]
TryBlock:{ "try" block:Block }
| #[feature = "generators"]
Yield:{ "yield" value:Expr? }
| Closure:{
#[feature = "generators"]
generator_static:"static"?
#[feature = "async"]
asyncness:"async"?
by_val:"move"? "|" args:ClosureArg* % "," ","? "|" { "->" ret_ty:Type }? body:Expr
}
; with the semantics that It's nice because you can see the context to which the gated alternation applies and the other things so you can understand this better. |
I would personally advocate higher order grammars and just writing |
What I mean is you'd have something like this in an ExprKind |=
AsyncBlock:{ "async" block:Block }; (not an imperative |
Oops, I forgot to mark this as WIP (so let's not land it for now). |
grammar/expr.g
Outdated
Index:{ base:Expr "[" index:Expr "]" } | | ||
Array:{ "[" attrs:InnerAttr* exprs:Expr* % "," ","? "]" } | | ||
Repeat:{ "[" attrs:InnerAttr* elem:Expr ";" count:Expr "]" } | | ||
Tuple:{ "(" attrs:InnerAttr* exprs:Expr* % "," ","? ")" } | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(EXPR)
is accepted as a tuple here (introducing false ambiguity).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Same with types and patterns.)
grammar/expr.g
Outdated
If = "if" cond:Cond then:Block { "else" else_expr:ElseExpr }?; | ||
Cond = | ||
Bool:Expr | | ||
Let:{ "let" pat:Pat "=" expr:Expr }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if let PAT | PAT = EXPR { ... }
is not covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might've been accepted/implemented after I've already opened this PR, heh.
So, I started reading this at last (now at Two issues are immediately noticeable:
|
The generated parser accepts a (lykenware/gll side note: how are unbalanced groups handled (error reported) in the TokenStream generator? Not relevant for us though.) List notation has been noted before, intent is to add a notation |
grammar/type.g
Outdated
binder:ForAllBinder? unsafety:"unsafe"? { "extern" abi:Abi }? | ||
"fn" "(" inputs:FnSigInputs? ","? ")" { "->" ret_ty:Type }? | ||
} | | ||
ImplTrait:{ "impl" bounds:TypeBound+ % "+" "+"? } | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, it's TypeBound*
!
#[cfg(FALSE)]
fn f() -> impl { 10 } // OK
Also true for DynTrait
if dyn
is present (at least on 2018 edition when dyn
is a proper keyword).
One thing I noticed is that the grammar is somewhat relaxed compared to what accepted by libsyntax parser (and libsyntax parser in general accept semantically invalid constructions if they can fit into AST - that forms the "language accepted under |
grammar/stmt.g
Outdated
Item:Item | | ||
Local:{ attrs:OuterAttr* "let" pat:Pat { ":" ty:Type }? { "=" init:Expr }? ";" } | | ||
Expr:Expr | | ||
Semi:";"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If nonterminals in this grammar are supposed to match fragments in macro_rules
, then Stmt
needs to be organized slightly differently, more like
Stmt =
Item:Item |
Local:{ attrs:OuterAttr* "let" pat:Pat { ":" ty:Type }? { "=" init:Expr }? } |
Expr:Expr;
StmtX =
Item |
Local ";" |
Expr |
";";
Block = "{" attrs:InnerAttr* StmtX* "}";
Fix formatting + unstable notes due to ehuss
adopt .lyg as lykenware/gll's new file extension
in the spirit of improving visibility and making progress I'm going to fix the conflicts now and merge this PR. |
closing this as #17 is merged |
Tested at rust-lang/rust@f6e9a6e, after
git submodule update --recursive --init
:The "partial / prefix parse" bit pretty much means there's a syntax error somewhere within the file.
(click to see the exhaustive list of files that didn't parse)
AFAICT, none of those should parse, but I haven't double-checked against
libsyntax
yet.Also, it might be useful to do the reverse (to make sure we don't parse more than
rustc
).In order to get as much as possible to parse, I did include unstable syntax. However, we might be able to separate it out, or even just comment it out for now.
There are 3 blacklisted files, because they're unfeasibly large to parse currently:
libcore/unicode/tables.rs
issues/issue-29466.rs
issues/issue-29227.rs
I might end up using something similar to them to benchmark improvements to
lykenware/gll
.