The essence of lexer #59706

matklad · 2019-04-04T19:00:27Z

I would love to make a reusable library to lex rust code, which could be used by rustc, rust-analyzer, proc-macros, etc. This draft PR is my attempt at the API. Currently, the PR uses new lexer to lex comments and shebang, while using the old lexer for everything else. This should be enough to agree on the API though!

High-level picture

An rust_lexer crate is introduced, with zero or minimal (for XID_Start and other unicode) dependencies. This crate basically exposes a single function: next_token(&str) -> (TokenKind, usize) which returns the first token of a non-empty string (usize is the length of the token). The main goal of the API is to be minimal. Non-strictly essential concerns, like string interning, are left to the clients.

Finer Points

Iterator API

We probably should expose a convenience function fn tokenize(&str) -> impl Iterator<Item = Token>

EDIT: I've added tokenize

Error handling

The lexer itself provides only minimal amount of error detection and reporting. Additionally, it never fatal-errors and always produces some non-empty token. Examples of errors detected by the lexer:

unterminated block comment
unterminated string literals

Example of errors not detected by the lexer:

invalid escape sequence in a string literal
out of range integer literal
bare \r in the doc comment.

The idea is that the clients are responsible for additional validation of tokens. This is the mode IDE operates in: you want to skip validation for library files, because you are not showing errors there anyway, and for user-code, you want to do a deep validation with quick fixes and suggestions, which is not really fit for the lexer itself.

In particular, in this PR unclosed /* comment is handled by the new lexer, bare \r and distinction between doc and non-doc comments is handled by the old lexer.

Performance

No attempt at performance measurement is made so far :) I think it is acceptable to regress perf here a bit in exchange for cleaner code, and I hope that regression wouldn't be too costly. In particular, because we validate tokens separately, we'll have to do one more pass for some of the tokens. I hope this is not a prohibitive cost. For example, for doc comments we already do two passes (lexing + interning), so adding a third one shouldn't be that much slower (and we also do an additional pass for utf-8 validation). And lexing is hopefully not a bottleneck. Note that for IDEs separate validation might actually improve performance, because we will be able to skip validation when, for example, computing completions.

Long term, I hope that this approach will allow for better performance. If we separate pure lexing, in the future we can code-gen super-optimizes state machine that walks utf-8 directly, instead of current manual char-by-char toil.

Cursor API

For implementation, I am going slightly unconventionally. Instead of defining a Lexer struct with a bunch of helper methods (current, bump) and a bunch of lexing methods (lex_comment, lex_whitespace), I define a Cursor struct which has only helpers, and define a top-level function with a &mut Cursor argument for each grammar production. I find this C-style more readable for parsers and lexers.

EDIT: swithced to a more conventional setup with lexing methods

So, what do folks think about this?

rust-highfive · 2019-04-04T19:00:37Z

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

src/libsyntax/parse/lexer/mod.rs

alexcrichton · 2019-04-05T13:28:38Z

I'm personally pretty unfamiliar with this work, but @matklad do you know who'd be good to review this?

matklad · 2019-04-05T13:31:12Z

That is a good question. Perhaps @eddyb or @petrochenkov? I also feel that maybe this needs to be tagged with T-compiler and discussed more generally?

alexcrichton · 2019-04-05T13:32:44Z

r? @petrochenkov

bors · 2019-04-05T15:11:37Z

☔ The latest upstream changes (presumably #59721) made this pull request unmergeable. Please resolve the merge conflicts.

petrochenkov · 2019-04-07T22:35:00Z

Since I was assigned, here are my priorities:

High priority: the lexer code (both interfaces and implementation) can be tweaked at any time for performance or other reasons (this means zero stability guarantees) without infrastructural hurdles (no separate repos, submodule updates, crate version changes).
Lower priority: reuse of the lexer code with other projects.

So, if the lexer crate follows the model of rustc-ap-syntax, then I'm happy.
(It should probably be named librustc_lexer rather than rust_lexer in that case.)

If the first priority is satisfied, then I'm not even too interested in discussing the exact interface of the proposed reusable lexer - it could be improved at any time if some usability or performance issues are found.
Frankly, I have no idea how the perfect reusable lexer interface should look, I never wrote a whole lexer and don't know the requirements.
What this PR does seems fine for a start.

Reassigning to someone who can into high-level design.

matklad · 2019-04-08T08:05:58Z

Thanks!

I agree that this should be just a usual library in the rust monorepo, and that it shouldn't have any compatibility guarantees. As a stretch goal, I'd love to additionally make sure that just cargo test inside the librustc_lexer's dir works. This would help with

lexer-specific test suite: now, to test my changes, I need to build the rest of the compiler, b/c some bits are only covered by run-pass, and that is slow
a second "specification" implementation (a bunch of regexes + special casing /* and r#") which is compared with the production one and used in the language reference (this is something @rust-lang/wg-grammar might be interesting in).

The hard requirement for me though is building on stable. This is different from ap-syntax model, which is nightly only. I hope it'll "just work", the interface seems pretty minimal (although various unicode tables in libcore might be a problem). At worst, we can have a feature-flag in the create to enable rustc_private stuff.

eddyb · 2019-04-16T10:55:43Z

One concern I have is that the API of the old lexer kind of preceded external iterators.
That may sound strange, but, AFAIK, the lexer Reader may literally be older than Iterator.

So that said, I think we should have one or two of:

a stateless "match token at start of string" API
a stateful Iterator that applies 1. repeatedly

What I don't we should have is anything resembling the current API, which is stateful but at the same time it

I also agree with @petrochenkov that rustc_lex(er) (or, IMO, syntax_lex(er)) are better names.

src/rust_lexer/Cargo.lock

src/rust_lexer/Cargo.toml

src/rust_lexer/src/lib.rs

matklad · 2019-04-16T11:27:11Z

Thanks for the review @eddyb! Given the general thumbsup here, I'll work on this in the coming weeks to make this production ready!

So that said, I think we should have one or two of:

I think we should do both: stateless one is less powerful (you can't lex python-style f-strings with it), so, while rust lexical grammar admits stateless lexing, we should use it. Stateless is also good for incremental relexing. For the users though, iterator API on top of stateless API would be preferable.

I also plan to initially preserve the API of the current code in libsyntax exactly (by proxiing to the new crate), and do simplification refactoring in a separate PR.

Now that I think about it, is_beginning_of_file is only used for shebangs, right?

Yeah, I was debating about what to do with shebangs as well... Part of me wants to say "nah, this is implementation defined concern", and just don't handle it in this library. Your proposal of a separate fn strip_shebang is nice though: we both keep the core interface clean, but handle shebangs in an implementation-independent way.

That would make this a &str -> Option, which is very close to FromStr!

Is it OK for FromStr to consume only part of the input though? I guess, if the public API is fn tokenize(src: &str) -> impl Iterator<Item = Token> this doesn't even really matter.

I think it would be nicer if these were methods.

Heh, for me personally, free-standing functions for grammar productions and methods for lookahead/bump work much better, but, even if this approach is objectively better, it's still makes sense to go with methods to minimize exoticism. Will fix that!

@eddyb

The essence of lexer cc @eddyb I would love to make a reusable library to lex rust code, which could be used by rustc, rust-analyzer, proc-macros, etc. This **draft** PR is my attempt at the API. Currently, the PR uses new lexer to lex comments and shebang, while using the old lexer for everything else. This should be enough to agree on the API though! ### High-level picture An `rust_lexer` crate is introduced, with zero or minimal (for XID_Start and other unicode) dependencies. This crate basically exposes a single function: `next_token(&str) -> (TokenKind, usize)` which returns the first token of a non-empty string (`usize` is the length of the token). The main goal of the API is to be minimal. Non-strictly essential concerns, like string interning, are left to the clients. ### Finer Points #### Iterator API We probably should expose a convenience function `fn tokenize(&str) -> impl Iterator<Item = Token>` EDIT: I've added `tokenize` #### Error handling The lexer itself provides only minimal amount of error detection and reporting. Additionally, it never fatal-errors and always produces some non-empty token. Examples of errors detected by the lexer: * unterminated block comment * unterminated string literals Example of errors **not** detected by the lexer: * invalid escape sequence in a string literal * out of range integer literal * bare `\r` in the doc comment. The idea is that the clients are responsible for additional validation of tokens. This is the mode IDE operates in: you want to skip validation for library files, because you are not showing errors there anyway, and for user-code, you want to do a deep validation with quick fixes and suggestions, which is not really fit for the lexer itself. In particular, in this PR unclosed `/*` comment is handled by the new lexer, bare `\r` and distinction between doc and non-doc comments is handled by the old lexer. #### Performance No attempt at performance measurement is made so far :) I think it is acceptable to regress perf here a bit in exchange for cleaner code, and I hope that regression wouldn't be too costly. In particular, because we validate tokens separately, we'll have to do one more pass for some of the tokens. I hope this is not a prohibitive cost. For example, for doc comments we already do two passes (lexing + interning), so adding a third one shouldn't be that much slower (and we also do an additional pass for utf-8 validation). And lexing is hopefully not a bottleneck. Note that for IDEs separate validation might actually improve performance, because we will be able to skip validation when, for example, computing completions. Long term, I hope that this approach will allow for *better* performance. If we separate pure lexing, in the future we can code-gen super-optimizes state machine that walks utf-8 directly, instead of current manual char-by-char toil. #### Cursor API For implementation, I am going slightly unconventionally. Instead of defining a `Lexer` struct with a bunch of helper methods (`current`, `bump`) and a bunch of lexing methods (`lex_comment`, `lex_whitespace`), I define a `Cursor` struct which has only helpers, and define a top-level function with a `&mut Cursor` argument for each grammar production. I find this C-style more readable for parsers and lexers. EDIT: swithced to a more conventional setup with lexing methods So, what do folks think about this?

bors · 2019-07-21T10:58:52Z

☀️ Test successful - checks-azure
Approved by: petrochenkov
Pushing 83dfe7b to master...

matklad · 2019-07-21T11:17:14Z

🎉 hopefully, this is the last lexer for Rust :)

eddyb · 2019-07-21T11:25:24Z

@matklad Would be cool to share it between the compiler and proc-macro2, as well!
cc @dtolnay

@matklad

Use unicode-xid crate instead of libcore This PR proposes to remove `char::is_xid_start` and `char::is_xid_continue` functions from `libcore` and use `unicode_xid` crate from crates.io (note that this crate is already present in rust-lang/rust's Cargo.lock). Reasons to do this: * removing rustc-binary-specific stuff from libcore * making sure that, across the ecosystem, there's a single definition of what rust identifier is (`unicode-xid` has almost 10 million downs, as a `proc_macro2` dependency) * making it easier to share `rustc_lexer` crate with rust-analyzer: no need to `#[cfg]` if we are building as a part of the compiler Reasons not to do this: * increased maintenance burden: we'll need to upgrade unicode version both in libcore and in unicode-xid. However, this shouldn't be a too heavy burden: just running `./unicode.py` after new unicode version. I (@matklad) am ready to be a t-compiler side maintainer of unicode-xid. Moreover, given that xid-unicode is an important dependency of syn, *someone* needs to maintain it anyway. * xid-unicode implementation is significantly slower. It uses a more compact table with binary search, instead of a trie. However, this shouldn't matter in practice, because we have fast-path for ascii anyway, and code size savings is a plus. Moreover, in rust-lang#59706 not using libcore turned out to be *faster*, presumably beacause checking for whitespace with match is even faster. <details> <summary>old description</summary> Followup to rust-lang#59706 r? @eddyb Note that this doesn't actually remove tables from libcore, to avoid conflict with rust-lang#62641. cc unicode-rs/unicode-xid#11 </details>

@matklad

Use unicode-xid crate instead of libcore This PR proposes to remove `char::is_xid_start` and `char::is_xid_continue` functions from `libcore` and use `unicode_xid` crate from crates.io (note that this crate is already present in rust-lang/rust's Cargo.lock). Reasons to do this: * removing rustc-binary-specific stuff from libcore * making sure that, across the ecosystem, there's a single definition of what rust identifier is (`unicode-xid` has almost 10 million downs, as a `proc_macro2` dependency) * making it easier to share `rustc_lexer` crate with rust-analyzer: no need to `#[cfg]` if we are building as a part of the compiler Reasons not to do this: * increased maintenance burden: we'll need to upgrade unicode version both in libcore and in unicode-xid. However, this shouldn't be a too heavy burden: just running `./unicode.py` after new unicode version. I (@matklad) am ready to be a t-compiler side maintainer of unicode-xid. Moreover, given that xid-unicode is an important dependency of syn, *someone* needs to maintain it anyway. * xid-unicode implementation is significantly slower. It uses a more compact table with binary search, instead of a trie. However, this shouldn't matter in practice, because we have fast-path for ascii anyway, and code size savings is a plus. Moreover, in rust-lang#59706 not using libcore turned out to be *faster*, presumably beacause checking for whitespace with match is even faster. <details> <summary>old description</summary> Followup to rust-lang#59706 r? @eddyb Note that this doesn't actually remove tables from libcore, to avoid conflict with rust-lang#62641. cc unicode-rs/unicode-xid#11 </details>

@matklad

Use unicode-xid crate instead of libcore This PR proposes to remove `char::is_xid_start` and `char::is_xid_continue` functions from `libcore` and use `unicode_xid` crate from crates.io (note that this crate is already present in rust-lang/rust's Cargo.lock). Reasons to do this: * removing rustc-binary-specific stuff from libcore * making sure that, across the ecosystem, there's a single definition of what rust identifier is (`unicode-xid` has almost 10 million downs, as a `proc_macro2` dependency) * making it easier to share `rustc_lexer` crate with rust-analyzer: no need to `#[cfg]` if we are building as a part of the compiler Reasons not to do this: * increased maintenance burden: we'll need to upgrade unicode version both in libcore and in unicode-xid. However, this shouldn't be a too heavy burden: just running `./unicode.py` after new unicode version. I (@matklad) am ready to be a t-compiler side maintainer of unicode-xid. Moreover, given that xid-unicode is an important dependency of syn, *someone* needs to maintain it anyway. * xid-unicode implementation is significantly slower. It uses a more compact table with binary search, instead of a trie. However, this shouldn't matter in practice, because we have fast-path for ascii anyway, and code size savings is a plus. Moreover, in rust-lang#59706 not using libcore turned out to be *faster*, presumably beacause checking for whitespace with match is even faster. <details> <summary>old description</summary> Followup to rust-lang#59706 r? @eddyb Note that this doesn't actually remove tables from libcore, to avoid conflict with rust-lang#62641. cc unicode-rs/unicode-xid#11 </details>

matklad · 2019-09-05T20:08:09Z

published as https://crates.io/crates/rustc_lexer

eddyb · 2019-09-06T15:16:15Z

@matklad Maybe we should move it out of tree if we publish it on crates.io?

cc @rust-lang/compiler I'm not sure if we have a clear policy on this.

Centril · 2019-09-06T15:20:40Z

I'm opposed to moving important parts of the compiler out of tree because it becomes impractical to review changes to them from a language team perspective. I don't want to have to have to check what a bump in a PR that just updates Cargo.lock means in terms of semantic changes to the language.

matklad · 2019-09-06T15:53:15Z

I feel like we need a dedicated discussion/RFC to figure out how to organize libraries in the librariified world. At the moment, I am content with the status quo, and I don't feel like we should change anything right now. Rather, we should ask this question when something bigger, like chalk, matures.

Long term, I personally would prefer a monorepo setup, but one where just cargo test --package thing-I-am-hacking is enough for most testing and where ./x.py test is something that mostly only bors executes. That is, I don't see a problem in things being in the same source tree, I see a problem with building the whole tree to get the basic testing of a single component.

eddyb · 2019-09-06T16:54:39Z

FWIW I would like ./x.py test --stage 0 src/librustc_lexer to "just" work.

@Mark-Simulacrum has some ideas about being able to use the most recent CI build artifact to avoid bootstrapping, not sure if they're required or not.

Mark-Simulacrum · 2019-09-06T16:59:27Z

IMO, if that doesn't work today, it's probably a bug. Furthermore, I would expect crates that are decoupled from the compiler (i.e., don't depend on unstable details from libstd and such) to work via cargo test --manifest-path ... as well, if you want to use fully "native" Cargo.

matklad · 2019-09-06T17:13:14Z

FWIW I would like ./x.py test --stage 0 src/librustc_lexer to "just" work.

It works, but only after going via the whole bootstraping process.

Furthermore, I would expect crates that are decoupled from the compiler (i.e., don't depend on unstable details from libstd and such) to work via cargo test --manifest-path ... as well, if you want to use fully "native" Cargo.

Wow, this indeed just works for rustc_lexer, now that #62848 is merged. It's not maximally useful at the moment, as there are few lexer specific tests, but hopefully rust-lang/wg-grammar#3 will fix that.

I guess that I am now happy with the current setup as the long-term setup :) The only minor thing which maybe worth doing, if we are to pursue librarization seriously, is to move clean, bootstrap-independent librarified components from /src/libfoo to /crates/foo: the current src directory looks a little like kitchen sink.

rust-highfive assigned alexcrichton Apr 4, 2019

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 4, 2019

matklad commented Apr 4, 2019

View reviewed changes

src/libsyntax/parse/lexer/mod.rs Outdated Show resolved Hide resolved

alexcrichton added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2019

rust-highfive assigned petrochenkov and unassigned alexcrichton Apr 5, 2019

petrochenkov assigned Zoxc and eddyb and unassigned petrochenkov Apr 7, 2019

This comment has been minimized.

Sign in to view

rust-highfive assigned petrochenkov and unassigned Zoxc and eddyb Apr 15, 2019

petrochenkov assigned Zoxc and eddyb and unassigned petrochenkov Apr 15, 2019