-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The essence of lexer #59706
The essence of lexer #59706
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
I'm personally pretty unfamiliar with this work, but @matklad do you know who'd be good to review this? |
That is a good question. Perhaps @eddyb or @petrochenkov? I also feel that maybe this needs to be tagged with T-compiler and discussed more generally? |
☔ The latest upstream changes (presumably #59721) made this pull request unmergeable. Please resolve the merge conflicts. |
Since I was assigned, here are my priorities:
So, if the lexer crate follows the model of rustc-ap-syntax, then I'm happy. If the first priority is satisfied, then I'm not even too interested in discussing the exact interface of the proposed reusable lexer - it could be improved at any time if some usability or performance issues are found. Reassigning to someone who can into high-level design. |
Thanks! I agree that this should be just a usual library in the rust monorepo, and that it shouldn't have any compatibility guarantees. As a stretch goal, I'd love to additionally make sure that just
The hard requirement for me though is building on stable. This is different from ap-syntax model, which is nightly only. I hope it'll "just work", the interface seems pretty minimal (although various unicode tables in libcore might be a problem). At worst, we can have a feature-flag in the create to enable rustc_private stuff. |
This comment has been minimized.
This comment has been minimized.
One concern I have is that the API of the old lexer kind of preceded external iterators. So that said, I think we should have one or two of:
What I don't we should have is anything resembling the current API, which is stateful but at the same time it I also agree with @petrochenkov that |
Thanks for the review @eddyb! Given the general thumbsup here, I'll work on this in the coming weeks to make this production ready!
I think we should do both: stateless one is less powerful (you can't lex python-style f-strings with it), so, while rust lexical grammar admits stateless lexing, we should use it. Stateless is also good for incremental relexing. For the users though, iterator API on top of stateless API would be preferable. I also plan to initially preserve the API of the current code in libsyntax exactly (by proxiing to the new crate), and do simplification refactoring in a separate PR.
Yeah, I was debating about what to do with shebangs as well... Part of me wants to say "nah, this is implementation defined concern", and just don't handle it in this library. Your proposal of a separate
Is it OK for
Heh, for me personally, free-standing functions for grammar productions and methods for lookahead/bump work much better, but, even if this approach is objectively better, it's still makes sense to go with methods to minimize exoticism. Will fix that! |
The essence of lexer cc @eddyb I would love to make a reusable library to lex rust code, which could be used by rustc, rust-analyzer, proc-macros, etc. This **draft** PR is my attempt at the API. Currently, the PR uses new lexer to lex comments and shebang, while using the old lexer for everything else. This should be enough to agree on the API though! ### High-level picture An `rust_lexer` crate is introduced, with zero or minimal (for XID_Start and other unicode) dependencies. This crate basically exposes a single function: `next_token(&str) -> (TokenKind, usize)` which returns the first token of a non-empty string (`usize` is the length of the token). The main goal of the API is to be minimal. Non-strictly essential concerns, like string interning, are left to the clients. ### Finer Points #### Iterator API We probably should expose a convenience function `fn tokenize(&str) -> impl Iterator<Item = Token>` EDIT: I've added `tokenize` #### Error handling The lexer itself provides only minimal amount of error detection and reporting. Additionally, it never fatal-errors and always produces some non-empty token. Examples of errors detected by the lexer: * unterminated block comment * unterminated string literals Example of errors **not** detected by the lexer: * invalid escape sequence in a string literal * out of range integer literal * bare `\r` in the doc comment. The idea is that the clients are responsible for additional validation of tokens. This is the mode IDE operates in: you want to skip validation for library files, because you are not showing errors there anyway, and for user-code, you want to do a deep validation with quick fixes and suggestions, which is not really fit for the lexer itself. In particular, in this PR unclosed `/*` comment is handled by the new lexer, bare `\r` and distinction between doc and non-doc comments is handled by the old lexer. #### Performance No attempt at performance measurement is made so far :) I think it is acceptable to regress perf here a bit in exchange for cleaner code, and I hope that regression wouldn't be too costly. In particular, because we validate tokens separately, we'll have to do one more pass for some of the tokens. I hope this is not a prohibitive cost. For example, for doc comments we already do two passes (lexing + interning), so adding a third one shouldn't be that much slower (and we also do an additional pass for utf-8 validation). And lexing is hopefully not a bottleneck. Note that for IDEs separate validation might actually improve performance, because we will be able to skip validation when, for example, computing completions. Long term, I hope that this approach will allow for *better* performance. If we separate pure lexing, in the future we can code-gen super-optimizes state machine that walks utf-8 directly, instead of current manual char-by-char toil. #### Cursor API For implementation, I am going slightly unconventionally. Instead of defining a `Lexer` struct with a bunch of helper methods (`current`, `bump`) and a bunch of lexing methods (`lex_comment`, `lex_whitespace`), I define a `Cursor` struct which has only helpers, and define a top-level function with a `&mut Cursor` argument for each grammar production. I find this C-style more readable for parsers and lexers. EDIT: swithced to a more conventional setup with lexing methods So, what do folks think about this?
☀️ Test successful - checks-azure |
Use unicode-xid crate instead of libcore This PR proposes to remove `char::is_xid_start` and `char::is_xid_continue` functions from `libcore` and use `unicode_xid` crate from crates.io (note that this crate is already present in rust-lang/rust's Cargo.lock). Reasons to do this: * removing rustc-binary-specific stuff from libcore * making sure that, across the ecosystem, there's a single definition of what rust identifier is (`unicode-xid` has almost 10 million downs, as a `proc_macro2` dependency) * making it easier to share `rustc_lexer` crate with rust-analyzer: no need to `#[cfg]` if we are building as a part of the compiler Reasons not to do this: * increased maintenance burden: we'll need to upgrade unicode version both in libcore and in unicode-xid. However, this shouldn't be a too heavy burden: just running `./unicode.py` after new unicode version. I (@matklad) am ready to be a t-compiler side maintainer of unicode-xid. Moreover, given that xid-unicode is an important dependency of syn, *someone* needs to maintain it anyway. * xid-unicode implementation is significantly slower. It uses a more compact table with binary search, instead of a trie. However, this shouldn't matter in practice, because we have fast-path for ascii anyway, and code size savings is a plus. Moreover, in rust-lang#59706 not using libcore turned out to be *faster*, presumably beacause checking for whitespace with match is even faster. <details> <summary>old description</summary> Followup to rust-lang#59706 r? @eddyb Note that this doesn't actually remove tables from libcore, to avoid conflict with rust-lang#62641. cc unicode-rs/unicode-xid#11 </details>
Use unicode-xid crate instead of libcore This PR proposes to remove `char::is_xid_start` and `char::is_xid_continue` functions from `libcore` and use `unicode_xid` crate from crates.io (note that this crate is already present in rust-lang/rust's Cargo.lock). Reasons to do this: * removing rustc-binary-specific stuff from libcore * making sure that, across the ecosystem, there's a single definition of what rust identifier is (`unicode-xid` has almost 10 million downs, as a `proc_macro2` dependency) * making it easier to share `rustc_lexer` crate with rust-analyzer: no need to `#[cfg]` if we are building as a part of the compiler Reasons not to do this: * increased maintenance burden: we'll need to upgrade unicode version both in libcore and in unicode-xid. However, this shouldn't be a too heavy burden: just running `./unicode.py` after new unicode version. I (@matklad) am ready to be a t-compiler side maintainer of unicode-xid. Moreover, given that xid-unicode is an important dependency of syn, *someone* needs to maintain it anyway. * xid-unicode implementation is significantly slower. It uses a more compact table with binary search, instead of a trie. However, this shouldn't matter in practice, because we have fast-path for ascii anyway, and code size savings is a plus. Moreover, in rust-lang#59706 not using libcore turned out to be *faster*, presumably beacause checking for whitespace with match is even faster. <details> <summary>old description</summary> Followup to rust-lang#59706 r? @eddyb Note that this doesn't actually remove tables from libcore, to avoid conflict with rust-lang#62641. cc unicode-rs/unicode-xid#11 </details>
Use unicode-xid crate instead of libcore This PR proposes to remove `char::is_xid_start` and `char::is_xid_continue` functions from `libcore` and use `unicode_xid` crate from crates.io (note that this crate is already present in rust-lang/rust's Cargo.lock). Reasons to do this: * removing rustc-binary-specific stuff from libcore * making sure that, across the ecosystem, there's a single definition of what rust identifier is (`unicode-xid` has almost 10 million downs, as a `proc_macro2` dependency) * making it easier to share `rustc_lexer` crate with rust-analyzer: no need to `#[cfg]` if we are building as a part of the compiler Reasons not to do this: * increased maintenance burden: we'll need to upgrade unicode version both in libcore and in unicode-xid. However, this shouldn't be a too heavy burden: just running `./unicode.py` after new unicode version. I (@matklad) am ready to be a t-compiler side maintainer of unicode-xid. Moreover, given that xid-unicode is an important dependency of syn, *someone* needs to maintain it anyway. * xid-unicode implementation is significantly slower. It uses a more compact table with binary search, instead of a trie. However, this shouldn't matter in practice, because we have fast-path for ascii anyway, and code size savings is a plus. Moreover, in rust-lang#59706 not using libcore turned out to be *faster*, presumably beacause checking for whitespace with match is even faster. <details> <summary>old description</summary> Followup to rust-lang#59706 r? @eddyb Note that this doesn't actually remove tables from libcore, to avoid conflict with rust-lang#62641. cc unicode-rs/unicode-xid#11 </details>
published as https://crates.io/crates/rustc_lexer |
@matklad Maybe we should move it out of tree if we publish it on cc @rust-lang/compiler I'm not sure if we have a clear policy on this. |
I'm opposed to moving important parts of the compiler out of tree because it becomes impractical to review changes to them from a language team perspective. I don't want to have to have to check what a bump in a PR that just updates |
I feel like we need a dedicated discussion/RFC to figure out how to organize libraries in the librariified world. At the moment, I am content with the status quo, and I don't feel like we should change anything right now. Rather, we should ask this question when something bigger, like chalk, matures. Long term, I personally would prefer a monorepo setup, but one where just |
FWIW I would like @Mark-Simulacrum has some ideas about being able to use the most recent CI build artifact to avoid bootstrapping, not sure if they're required or not. |
IMO, if that doesn't work today, it's probably a bug. Furthermore, I would expect crates that are decoupled from the compiler (i.e., don't depend on unstable details from libstd and such) to work via |
It works, but only after going via the whole bootstraping process.
Wow, this indeed just works for rustc_lexer, now that #62848 is merged. It's not maximally useful at the moment, as there are few lexer specific tests, but hopefully rust-lang/wg-grammar#3 will fix that. I guess that I am now happy with the current setup as the long-term setup :) The only minor thing which maybe worth doing, if we are to pursue librarization seriously, is to move clean, bootstrap-independent librarified components from |
cc @eddyb
I would love to make a reusable library to lex rust code, which could be used by rustc, rust-analyzer, proc-macros, etc. This draft PR is my attempt at the API. Currently, the PR uses new lexer to lex comments and shebang, while using the old lexer for everything else. This should be enough to agree on the API though!
High-level picture
An
rust_lexer
crate is introduced, with zero or minimal (for XID_Start and other unicode) dependencies. This crate basically exposes a single function:next_token(&str) -> (TokenKind, usize)
which returns the first token of a non-empty string (usize
is the length of the token). The main goal of the API is to be minimal. Non-strictly essential concerns, like string interning, are left to the clients.Finer Points
Iterator API
We probably should expose a convenience function
fn tokenize(&str) -> impl Iterator<Item = Token>
EDIT: I've added
tokenize
Error handling
The lexer itself provides only minimal amount of error detection and reporting. Additionally, it never fatal-errors and always produces some non-empty token. Examples of errors detected by the lexer:
Example of errors not detected by the lexer:
\r
in the doc comment.The idea is that the clients are responsible for additional validation of tokens. This is the mode IDE operates in: you want to skip validation for library files, because you are not showing errors there anyway, and for user-code, you want to do a deep validation with quick fixes and suggestions, which is not really fit for the lexer itself.
In particular, in this PR unclosed
/*
comment is handled by the new lexer, bare\r
and distinction between doc and non-doc comments is handled by the old lexer.Performance
No attempt at performance measurement is made so far :) I think it is acceptable to regress perf here a bit in exchange for cleaner code, and I hope that regression wouldn't be too costly. In particular, because we validate tokens separately, we'll have to do one more pass for some of the tokens. I hope this is not a prohibitive cost. For example, for doc comments we already do two passes (lexing + interning), so adding a third one shouldn't be that much slower (and we also do an additional pass for utf-8 validation). And lexing is hopefully not a bottleneck. Note that for IDEs separate validation might actually improve performance, because we will be able to skip validation when, for example, computing completions.
Long term, I hope that this approach will allow for better performance. If we separate pure lexing, in the future we can code-gen super-optimizes state machine that walks utf-8 directly, instead of current manual char-by-char toil.
Cursor API
For implementation, I am going slightly unconventionally. Instead of defining a
Lexer
struct with a bunch of helper methods (current
,bump
) and a bunch of lexing methods (lex_comment
,lex_whitespace
), I define aCursor
struct which has only helpers, and define a top-level function with a&mut Cursor
argument for each grammar production. I find this C-style more readable for parsers and lexers.EDIT: swithced to a more conventional setup with lexing methods
So, what do folks think about this?