Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize ident #66670

Merged
merged 3 commits into from
Dec 26, 2019
Merged

Normalize ident #66670

merged 3 commits into from
Dec 26, 2019

Conversation

crlf0710
Copy link
Member

Perform unicode normalization on identifiers. Resolving the first bullet point in #55467.

@rust-highfive
Copy link
Collaborator

r? @davidtwco

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Nov 23, 2019
@crlf0710
Copy link
Member Author

r? @estebank
And i think will hurt performance a little. Maybe we should measure it before it lands.

@rust-highfive rust-highfive assigned estebank and unassigned davidtwco Nov 23, 2019
@rust-highfive
Copy link
Collaborator

The job x86_64-gnu-llvm-6.0 of your PR failed (pretty log, raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
2019-11-23T14:44:27.8437353Z ##[command]git remote add origin https://github.com/rust-lang/rust
2019-11-23T14:44:27.8453281Z ##[command]git config gc.auto 0
2019-11-23T14:44:27.8499331Z ##[command]git config --get-all http.https://github.com/rust-lang/rust.extraheader
2019-11-23T14:44:27.8540359Z ##[command]git config --get-all http.proxy
2019-11-23T14:44:27.8663581Z ##[command]git -c http.extraheader="AUTHORIZATION: basic ***" fetch --force --tags --prune --progress --no-recurse-submodules --depth=2 origin +refs/heads/*:refs/remotes/origin/* +refs/pull/66670/merge:refs/remotes/pull/66670/merge
---
2019-11-23T15:33:31.2561630Z .................................................................................................... 1600/9281
2019-11-23T15:33:35.4769065Z .................................................................................................... 1700/9281
2019-11-23T15:33:46.6108886Z ...........................i........................................................................ 1800/9281
2019-11-23T15:33:52.2225717Z .................................................................................................... 1900/9281
2019-11-23T15:34:04.2520439Z ............iiiii................................................................................... 2000/9281
2019-11-23T15:34:12.2595071Z .................................................................................................... 2200/9281
2019-11-23T15:34:14.7292704Z .................................................................................................... 2300/9281
2019-11-23T15:34:19.2892354Z .................................................................................................... 2400/9281
2019-11-23T15:34:36.5565960Z .................................................................................................... 2500/9281
---
2019-11-23T15:36:58.2344464Z ............i...............i....................................................................... 4800/9281
2019-11-23T15:37:06.6957700Z .................................................................................................... 4900/9281
2019-11-23T15:37:11.5742193Z .................................................................................................... 5000/9281
2019-11-23T15:37:19.7502276Z .................................................................................................... 5100/9281
2019-11-23T15:37:25.2078253Z .................ii.ii...........i.................................................................. 5200/9281
2019-11-23T15:37:33.5455009Z .................................................................................................... 5400/9281
2019-11-23T15:37:43.2590696Z ...................................................................................................i 5500/9281
2019-11-23T15:37:50.5966007Z .................................................................................................... 5600/9281
2019-11-23T15:37:55.9819282Z .................................................................................................... 5700/9281
2019-11-23T15:37:55.9819282Z .................................................................................................... 5700/9281
2019-11-23T15:38:05.3697115Z .....................................................................................ii...i..ii..... 5800/9281
2019-11-23T15:38:25.1355922Z .................................................................................................... 6000/9281
2019-11-23T15:38:31.8522220Z .................................................................................................... 6100/9281
2019-11-23T15:38:35.6726851Z .................................................................................................... 6200/9281
2019-11-23T15:38:35.6726851Z .................................................................................................... 6200/9281
2019-11-23T15:38:46.8930340Z ........i..ii....................................................................................... 6300/9281
2019-11-23T15:39:02.3584974Z ............................................................................i....................... 6500/9281
2019-11-23T15:39:04.4406845Z .................................................................................................... 6600/9281
2019-11-23T15:39:06.3849978Z ...................................................................i................................ 6700/9281
2019-11-23T15:39:08.8141806Z .................................................................................................... 6800/9281
---
2019-11-23T15:43:05.8447433Z ---- [ui] ui/codemap_tests/unicode_2.rs stdout ----
2019-11-23T15:43:05.8447480Z diff of stderr:
2019-11-23T15:43:05.8447527Z 
2019-11-23T15:43:05.8447558Z 14    |
2019-11-23T15:43:05.8447593Z 15    = help: valid widths are 8, 16, 32, 64 and 128
2019-11-23T15:43:05.8447831Z - error[E0425]: cannot find value `a̐é` in this scope
2019-11-23T15:43:05.8448008Z + error[E0425]: cannot find value `a̐é` in this scope
2019-11-23T15:43:05.8448184Z 18   --> $DIR/unicode_2.rs:6:13
2019-11-23T15:43:05.8448235Z 19    |
2019-11-23T15:43:05.8448235Z 19    |
2019-11-23T15:43:05.8448382Z 20 LL |     let _ = a̐é;
2019-11-23T15:43:05.8448404Z 
2019-11-23T15:43:05.8448422Z 
2019-11-23T15:43:05.8448471Z The actual stderr differed from the expected stderr.
2019-11-23T15:43:05.8448700Z Actual stderr saved to /checkout/obj/build/x86_64-unknown-linux-gnu/test/ui/codemap_tests/unicode_2/unicode_2.stderr
2019-11-23T15:43:05.8448887Z To update references, rerun the tests and pass the `--bless` flag
2019-11-23T15:43:05.8449099Z To only update this specific test, also pass `--test-args codemap_tests/unicode_2.rs`
2019-11-23T15:43:05.8449157Z error: 1 errors occurred comparing output.
2019-11-23T15:43:05.8449191Z status: exit code: 1
2019-11-23T15:43:05.8449191Z status: exit code: 1
2019-11-23T15:43:05.8449899Z command: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" "/checkout/src/test/ui/codemap_tests/unicode_2.rs" "-Zthreads=1" "--target=x86_64-unknown-linux-gnu" "--error-format" "json" "-Zui-testing" "-C" "prefer-dynamic" "--out-dir" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/ui/codemap_tests/unicode_2" "-Crpath" "-O" "-Cdebuginfo=0" "-Zunstable-options" "-Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "-L" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/ui/codemap_tests/unicode_2/auxiliary" "-A" "unused"
2019-11-23T15:43:05.8450207Z ------------------------------------------
2019-11-23T15:43:05.8450233Z 
2019-11-23T15:43:05.8450392Z ------------------------------------------
2019-11-23T15:43:05.8450441Z stderr:
2019-11-23T15:43:05.8450441Z stderr:
2019-11-23T15:43:05.8450782Z ------------------------------------------
2019-11-23T15:43:05.8451221Z error: invalid width `7` for integer literal
2019-11-23T15:43:05.8451586Z    |
2019-11-23T15:43:05.8451586Z    |
2019-11-23T15:43:05.8451829Z LL |     let _ = ("a̐éö̲", 0u7); //~ ERROR invalid width
2019-11-23T15:43:05.8451955Z    |
2019-11-23T15:43:05.8451955Z    |
2019-11-23T15:43:05.8452001Z    = help: valid widths are 8, 16, 32, 64 and 128
2019-11-23T15:43:05.8452034Z 
2019-11-23T15:43:05.8452098Z error: invalid width `42` for integer literal
2019-11-23T15:43:05.8452425Z    |
2019-11-23T15:43:05.8452425Z    |
2019-11-23T15:43:05.8452713Z LL |     let _ = ("아あ", 1i42); //~ ERROR invalid width
2019-11-23T15:43:05.8452809Z    |
2019-11-23T15:43:05.8452809Z    |
2019-11-23T15:43:05.8452876Z    = help: valid widths are 8, 16, 32, 64 and 128
2019-11-23T15:43:05.8453165Z error[E0425]: cannot find value `a̐é` in this scope
2019-11-23T15:43:05.8453424Z   --> /checkout/src/test/ui/codemap_tests/unicode_2.rs:6:13
2019-11-23T15:43:05.8453489Z    |
2019-11-23T15:43:05.8453725Z LL |     let _ = a̐é; //~ ERROR cannot find
---
2019-11-23T15:43:05.8479568Z thread 'main' panicked at 'Some tests failed', src/tools/compiletest/src/main.rs:537:22
2019-11-23T15:43:05.8479634Z note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
2019-11-23T15:43:05.8491102Z 
2019-11-23T15:43:05.8491212Z 
2019-11-23T15:43:05.8494428Z command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/compiletest" "--compile-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib" "--run-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/lib" "--rustc-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" "--src-base" "/checkout/src/test/ui" "--build-base" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/ui" "--stage-id" "stage2-x86_64-unknown-linux-gnu" "--mode" "ui" "--target" "x86_64-unknown-linux-gnu" "--host" "x86_64-unknown-linux-gnu" "--llvm-filecheck" "/usr/lib/llvm-6.0/bin/FileCheck" "--host-rustcflags" "-Crpath -O -Cdebuginfo=0 -Zunstable-options  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--target-rustcflags" "-Crpath -O -Cdebuginfo=0 -Zunstable-options  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--docck-python" "/usr/bin/python2.7" "--lldb-python" "/usr/bin/python2.7" "--gdb" "/usr/bin/gdb" "--quiet" "--llvm-version" "6.0.0\n" "--system-llvm" "--cc" "" "--cxx" "" "--cflags" "" "--llvm-components" "" "--llvm-cxxflags" "" "--adb-path" "adb" "--adb-test-dir" "/data/tmp/work" "--android-cross-path" "" "--color" "always"
2019-11-23T15:43:05.8494915Z 
2019-11-23T15:43:05.8495070Z 
2019-11-23T15:43:05.8504576Z failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test
2019-11-23T15:43:05.8504636Z Build completed unsuccessfully in 0:53:08
2019-11-23T15:43:05.8504636Z Build completed unsuccessfully in 0:53:08
2019-11-23T15:43:05.8548820Z == clock drift check ==
2019-11-23T15:43:05.8560947Z   local time: Sat Nov 23 15:43:05 UTC 2019
2019-11-23T15:43:06.3766966Z   network time: Sat, 23 Nov 2019 15:43:06 GMT
2019-11-23T15:43:06.3770366Z == end clock drift check ==
2019-11-23T15:43:07.2796943Z 
2019-11-23T15:43:07.2877973Z ##[error]Bash exited with code '1'.
2019-11-23T15:43:07.2907889Z ##[section]Starting: Checkout
2019-11-23T15:43:07.2909365Z ==============================================================================
2019-11-23T15:43:07.2909406Z Task         : Get sources
2019-11-23T15:43:07.2909441Z Description  : Get sources from a repository. Supports Git, TfsVC, and SVN repositories.

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@Centril Centril added the F-non_ascii_idents `#![feature(non_ascii_idents)]` label Nov 23, 2019
@Centril
Copy link
Contributor

Centril commented Nov 23, 2019

cc @Manishearth

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm from a unicode standpoint

hopefully there are no other places where we create Symbols?

@Centril
Copy link
Contributor

Centril commented Nov 23, 2019

cc @petrochenkov re. Symbols

@crlf0710
Copy link
Member Author

hopefully there are no other places where we create Symbols?

At least there's one in the proc_macro crate. I looked at the code but not sure how should i change it. I know nothing about the "bridge" thing. I'm leaving that into a further PR.

@estebank
Copy link
Contributor

@Manishearth shouldn't we have a warning when encountering non-NFC chars? Shouldn't we, for the sake of third party tools, attempt to "soft-enforce" byte comparable tokens? These questions might are possibly unrelated to this PR in particular and I should have brought up this point in the mega-thread.

@estebank
Copy link
Contributor

The PR LGTM, but I can foresee issues cropping up when involving macros.

@Manishearth
Copy link
Member

@estebank no, non-NFC chars appear for all kinds of reasons

@estebank
Copy link
Contributor

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Nov 26, 2019

⌛ Trying commit 0e3036b with merge ce0c4f6...

bors added a commit that referenced this pull request Nov 26, 2019
Normalize ident

Perform unicode normalization on identifiers. Resolving the first bullet point in #55467.
@bors
Copy link
Contributor

bors commented Nov 26, 2019

☀️ Try build successful - checks-azure
Build commit: ce0c4f6 (ce0c4f69329436f488883ad75290b9e38d921fed)

@rust-timer
Copy link
Collaborator

Queued ce0c4f6 with parent a44774c, future comparison URL.

@crlf0710
Copy link
Member Author

Mmm, is the timer task still queued?

@crlf0710
Copy link
Member Author

@bors try @rust-timer queue

@bors
Copy link
Contributor

bors commented Nov 29, 2019

@crlf0710: 🔑 Insufficient privileges: not in try users

@rust-timer
Copy link
Collaborator

Insufficient permissions to issue commands to rust-timer.

@Dylan-DPC-zz
Copy link

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Dec 11, 2019

🔒 Merge conflict

This pull request and the master branch diverged in a way that cannot be automatically merged. Please rebase on top of the latest master branch, and let the reviewer approve again.

How do I rebase?

Assuming self is your fork and upstream is this repository, you can resolve the conflict following these steps:

  1. git checkout normalize_ident (switch to your branch)
  2. git fetch upstream master (retrieve the latest master)
  3. git rebase upstream/master -p (rebase on top of it)
  4. Follow the on-screen instruction to resolve conflicts (check git status if you got lost).
  5. git push self normalize_ident --force-with-lease (update this PR)

You may also read Git Rebasing to Resolve Conflicts by Drew Blessing for a short tutorial.

Please avoid the "Resolve conflicts" button on GitHub. It uses git merge instead of git rebase which makes the PR commit history more difficult to read.

Sometimes step 4 will complete without asking for resolution. This is usually due to difference between how Cargo.lock conflict is handled during merge and rebase. This is normal, and you should still perform step 5 to update this PR.

Error message
Auto-merging src/librustc_parse/lexer/mod.rs
Auto-merging src/librustc_parse/Cargo.toml
CONFLICT (content): Merge conflict in src/librustc_parse/Cargo.toml
Auto-merging Cargo.lock
Automatic merge failed; fix conflicts and then commit the result.

@Mark-Simulacrum
Copy link
Member

I believe this caused the rollup failure here: #67435 (comment)

You can likely fix it by adding smallvec at version 1.0 (not 0.6!) to https://github.com/rust-lang/rust/blob/master/src/tools/rustc-workspace-hack/Cargo.toml#L65 and use Cargo's package renaming to make sure those two don't conflict as keys in the dependencies section.

@bors r-

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Dec 19, 2019
@crlf0710
Copy link
Member Author

crlf0710 commented Dec 26, 2019

Rebased and added dependency according to @Mark-Simulacrum 's instruction. @estebank

@Dylan-DPC-zz
Copy link

@bors r+

@bors
Copy link
Contributor

bors commented Dec 26, 2019

📌 Commit 27e7a1b has been approved by Dylan-DPC

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Dec 26, 2019
@bors
Copy link
Contributor

bors commented Dec 26, 2019

⌛ Testing commit 27e7a1b with merge acb6690...

bors added a commit that referenced this pull request Dec 26, 2019
Normalize ident

Perform unicode normalization on identifiers. Resolving the first bullet point in #55467.
@bors
Copy link
Contributor

bors commented Dec 26, 2019

☀️ Test successful - checks-azure
Approved by: Dylan-DPC
Pushing acb6690 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Dec 26, 2019
@bors bors merged commit 27e7a1b into rust-lang:master Dec 26, 2019
@crlf0710 crlf0710 deleted the normalize_ident branch December 26, 2019 16:03
@petrochenkov
Copy link
Contributor

I have two questions / concerns:

  • Is it possible to normalize a non-ASCII identifier into an ASCII identifier?
  • Even without that non-ASCII identifiers are gated at the wrong point (AST after macro expansion) and not entirely correctly (visit_name instead of identifier tokens), rather than at the correct point (during lexing).
    Is there any desire to fix the gating (do it in lexer before the normalization) before the feature is stabilized?

@crlf0710
Copy link
Member Author

Following the RFC, since we're using NFC normalization, I think the answer to the first question is "no it's not possible". NFC normalization is basically reordering and regrouping, so nothing really disappears.

@crlf0710
Copy link
Member Author

For the second question, i think that's a question for a potential "shepherd" for RFC 2457, currently we don't have one, but it would be best if there is one. cc #55467

In my personal opinion though, i think it's not too wrong to gate at the AST after macro expansion, since macros and procedure macros can generate new identifiers, and need to be gated too. I basically know nothing about the (visit_name instead of identifier tokens) part.

@crlf0710
Copy link
Member Author

And by the way i believe there're some remaining work on performing this normalization to the user data sent to proc_macro crate. the RFC mentioned that we need to make sure:

Every API accepting raw identifiers (such as proc_macro::Ident::new normalizes them to NFC and APIs returning them as strings (like proc_macro::Ident::to_string) return the normalized form.

I think I need some mentor here. @petrochenkov could you give some instructions when you have time? I tried to read the code in proc_macro crate, but got total lost in the bridge-related thing. It's also concerning if we need to add the unicode-normalization dependency to the proc_macro crate.

@petrochenkov
Copy link
Contributor

@crlf0710
The "cannot be a raw identifier" error is an example of catching something at lexer level.

It's emitted twice in librustc_parse\lexer\mod.rs and in libsyntax_expand\proc_macro_server.rs and those are the only places where an identifier token can be created from user input.

So, the normalization (and also gating, arguably) needs to happen in the same places.
(It will catch anything produced by macros as well.)

@crlf0710
Copy link
Member Author

@petrochenkov do you mean https://github.com/rust-lang/rust/blob/master/src/libsyntax_expand/proc_macro_server.rs#L330 ?

I'm a little confused. After normalization, if the identifier changes, should i intern, get a new Symbol and use it in the result identifier? What should i do there with the existing Symbol, leaving it alone?

@petrochenkov
Copy link
Contributor

@crlf0710
Yes.

@crlf0710
Copy link
Member Author

Great, let me create a branch and PR it.

@petrochenkov
Copy link
Contributor

petrochenkov commented Dec 29, 2019

By the way, the RFC says the normalization is performed during parsing rather than lexing.
That means proc macros receive the original unnormalized tokens and can perform their own normalization (same as with escaping in literals).
But that's probably not the intent of the RFC.

On the other hand, if the normalization is performed during lexing, that would be the first case in which we do not preserve original tokens by design (except for a big hack with doc comments in macros).

Centril added a commit to Centril/rust that referenced this pull request Dec 31, 2019
…henkov

Add symbol normalization for proc_macro_server.

Follow up for rust-lang#66670, finishing the first bullet point in rust-lang#55467.

r? @petrochenkov
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
F-non_ascii_idents `#![feature(non_ascii_idents)]` merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.