-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an "ascii character" type to reduce unsafe
needs in conversions
#179
Comments
I like the idea. Could it be added as a private API and used as an implementation detail for now? That would be a perfect demonstration of the usefulness of this type. One consideration: if we add an "ascii character" type, why not also an "ascii string" type? Or would an "ascii string" just be |
It doesn't fit great as a There'd be no extra invariants for an |
It seems you build on pattern types ( rust-lang/rust#107606 )? |
@safinaskar I wrote it that way as the simplest way to phrase it in the summary. The field is private, so there are many possible implementation -- including the basic works-on-stable approach of just making it (be or wrap) a big enum -- and this could thus move forward immediately, without needing to be blocked on speculative features like pattern types. |
one other thing to add: |
You can wrap 128-variant enum in |
Worth mentioning this is exactly what the |
I'll second this. What I like about this is that it feels like a pretty small addition (a single type), but it provides a lot of additional expressive and safe power. I've run into these kinds of situations in various places, and it is in part (small part) one of the nice things about Also, while not part of the proposal, I saw a mention of separate One other thing that I think is worth bringing up is the |
Other reproductions of an ASCII enum: icu4x, the Unicode library, has a Personally, I lean towards using an #[rustc_layout_scalar_valid_range_start(0)]
#[rustc_layout_scalar_valid_range_end(128)]
#[repr(transparent)]
pub struct AsciiChar(u8); This is a less flexible design. The names or properties of ASCII characters will never change and there aren't many of them, and so we might as well expose them as variants that can participate in pattern matching.
|
It's not. |
PR open: rust-lang/rust#111009 |
Add `ascii::Char` (ACP#179) ACP second: rust-lang/libs-team#179 (comment) New tracking issue: rust-lang#110998 For now this is an `enum` as `@kupiakos` [suggested](rust-lang/libs-team#179 (comment)), with the variants under a different feature flag. There's lots more things that could be added here, and place for further doc updates, but this seems like a plausible starting point PR. I've gone through and put an `as_ascii` next to every `is_ascii`: on `u8`, `char`, `[u8]`, and `str`. As a demonstration, made a commit updating some formatting code to use this: scottmcm@ascii-char-in-fmt (I don't want to include that in this PR, though, because that brings in perf questions that don't exist if this is just adding new unstable APIs.)
This landed on nightly https://doc.rust-lang.org/nightly/std/ascii/enum.Char.html so I'll close this as accepted 🎉 |
@rustbot labels +ACP-accepted |
Add `ascii::Char` (ACP#179) ACP second: rust-lang/libs-team#179 (comment) New tracking issue: rust-lang/rust#110998 For now this is an `enum` as `@kupiakos` [suggested](rust-lang/libs-team#179 (comment)), with the variants under a different feature flag. There's lots more things that could be added here, and place for further doc updates, but this seems like a plausible starting point PR. I've gone through and put an `as_ascii` next to every `is_ascii`: on `u8`, `char`, `[u8]`, and `str`. As a demonstration, made a commit updating some formatting code to use this: scottmcm/rust@ascii-char-in-fmt (I don't want to include that in this PR, though, because that brings in perf questions that don't exist if this is just adding new unstable APIs.)
Since core::ascii::Char::digit returns Some for decimal digits only, introduce core::ascii::Char::from_digit method which handles values up to 35. It mimics char::from_digit though it forgoes the radix parameter. Issue: rust-lang/libs-team#179 Issue: rust-lang#110998
Since core::ascii::Char::digit returns Some for decimal digits only, introduce core::ascii::Char::from_digit method which handles values up to 35. It mimics char::from_digit though it forgoes the radix parameter. Issue: rust-lang/libs-team#179 Issue: rust-lang#110998
Proposal
Problem statement
For individual bytes, the backend can plausibly be expected to optimize things when it knows that a
char
came from an ASCII-ranged value. However, for compound values, there's no realistic way that codegen backends can track enough information to optimize out UTF-8 validity checks. That leads to lots of "look, it's ASCII" comments onunsafe
blocks because the safe equivalents have material performance impact.We should offer a nice way that people can write such "the problem fundamentally only produces ASCII
String
s" code without needingunsafe
and without needing spurious UTF-8 validation checks.After all, to quote
std::ascii
,Motivation, use-cases
I was reminded about this by this comment in rust-lang/rust#105076:
But I've been thinking about this since this Reddit thread: https://www.reddit.com/r/rust/comments/yaft60/zerocost_iterator_abstractionsnot_so_zerocost/. "base85" encoding is an examplar of problems where problem is fundamentally only producing ASCII. But the code is doing a
String::from_utf8(outdata).unwrap()
at the end because other options aren't great.One might say "oh, just build a
String
as you go" instead, but that doesn't work as well as you'd hope. Pushing a byte onto aVec<u8>
generates substantially less complicated code than pushing one to aString
(https://rust.godbolt.org/z/xMYxj5WYr) since a0..=255
USV might still take 2 bytes in UTF-8. That problem could be worked around with aBString
instead, going outside the standard library, but that's not a fix for the whole thing because then there's still a check needed later to get back to a&str
orString
.There should be a
core
type for an individual ASCII character so that having proven to LLVM at the individual character level that things are in-range (which it can optimize well, and does in other similar existing cases today), the library can offer safe O(1) conversions taking advantage of that type-level information.[Edit 2023-02-16] This conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/.E2.9C.94.20Iterate.20ASCII-only.20.26str/near/328343781 made me think about this too -- having a type gives clear ways to get "just the ascii characters from a string" using something like
.filter_map(AsciiChar::new)
.[Edit 2023-04-27] Another conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/core.3A.3Astr.3A.3Afrom_ascii/near/353452589 about how on embedded the "is ascii" check is much simpler than the "is UTF-8" check, and being able to use that where appropriate can save a bunch of binary size on embedded. cc @kupiakos
Solution sketches
In
core::ascii
,In
alloc::string
:^ this
From
being the main idea of the whole thingSafe code can
Char::new(…).unwrap()
since LLVM easily optimizes that for known values (https://rust.godbolt.org/z/haabhb6aq) or they can do it inconst
s, then use the non-reallocating infallibleFrom
s later if they needString
s or&str
s.Other possibilities
I wouldn't put any of these in an initial PR, but as related things
repr(u8)
. That would allowas
casting it, for better or worse.ACK
andDEL
and such.AsRef<str>
s are possible, like onascii::Char
itself or arrays/vectors thereofAsRef<[u8]>
s tooString::push_ascii
String: Extend<ascii::Char>
orString: FromIterator<ascii::Char>
&str
(or&[u8]
) back to&[ascii::Char]
[u8; N] -> [ascii::Char; N]
operation it can use in aconst
so it can have something likeconst BYTE_TO_CHAR85: [ascii::Char; 85] = something(b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#$%&()*+-;<=>?@^_`{|}~").unwrap();
. Long-term that's calledarray::map
and unwrapping each value, but without const closures that doesn't work yet -- for now it could open-code it, though.And of course there's the endless bikeshed on what to call the type in the first place. Would it be worth making it something like
ascii::AsciiChar
, despite the stuttering name, to avoidChar
vschar
confusion?What happens now?
This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.
The text was updated successfully, but these errors were encountered: