Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for UTF-16 decoding iterators #27830

Closed
SimonSapin opened this issue Aug 14, 2015 · 10 comments
Closed

Tracking issue for UTF-16 decoding iterators #27830

SimonSapin opened this issue Aug 14, 2015 · 10 comments
Labels
B-unstable Blocker: Implemented in the nightly compiler and unstable. final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@SimonSapin
Copy link
Contributor

#27808 proposes exposing in std::char two iterator adaptors Utf16Decoder and Utf16LossyDecoder. This functionality was previously only available with an API that require allocation (String::from_utf16{,_lossy}) or using the unstable rustc_unicode crate directly.

They are exposed unstable with a new utf16_decoder feature name. I’d like to stabilize them when we’re confident with the naming and API.

@alexcrichton alexcrichton added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. B-unstable Blocker: Implemented in the nightly compiler and unstable. labels Mar 8, 2016
@alexcrichton
Copy link
Member

🔔 This issue is now entering its cycle-long final comment period for stabilization in 1.9 🔔

@alexcrichton alexcrichton added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed I-nominated labels Mar 11, 2016
@alexcrichton
Copy link
Member

My personal opinion:

Ok, so the APIs that this is talking about are:

  • std::char::decode_utf16
  • std::char::DecodeUtf16
  • impl Iterator<Item=Result<char, u16>> for DecodeUtf16
  • std::char::REPLACEMENT_CHARACTER

I would not personally be a huge fan of stabilizing this cycle, some concerns being:

  • If we have decode_utf16, shouldn't we have decode_utf8? We have String::from_utf16{,_lossy} to mirror str::from_utf8{,_lossy}.
  • Do we want to support decode_utfXX_lossy? Are they easy enough to do with a .map(|e| e.unwrap_or(REPLACEMENT_CHARACTER))?
  • I don't think we have many other top-level functions in the standard library that take iterators and return iterators?
  • Having only one char constant in the module seems kinda odd, surely there are more constant-like things (like newline?) which we may wish to have?
  • The error type of the iterator may be a little unidiomatic, we probably want to wrap it up in something to provide extra contextual information in the future.

I think I'd be more comfortable with stabilizing given some utf-8 decoding functions as well, but it's probably worth also looking at the matrix of conversions we have:

A B A -> B B -> A
Iterator<u8> Iterator<char> not implemented `flat_map(
Iterator<u16> Iterator<char> char::decode_utf16 `flat_map(
&[u8] &str str::from_utf8 str::as_bytes
&[u16] String String::from_utf16 impossible
Iterator<u16> str impossible str::encode_utf16

I guess in that sense we'd be "complete" with decode_utf8? Is that all we need to put this issue to rest?

@aturon
Copy link
Member

aturon commented Apr 6, 2016

@alexcrichton

I don't think we have many other top-level functions in the standard library that take iterators and return iterators?

I don't see this as a concern, really. The more we work with iterators, the more natural things like "iterator transformers" such as this are. It's a fine ball to get rolling in my opinion.

The error type of the iterator may be a little unidiomatic, we probably want to wrap it up in something to provide extra contextual information in the future.

Agreed.

Having only one char constant in the module seems kinda odd, surely there are more constant-like things (like newline?) which we may wish to have?

Perhaps so -- but IMO this should influence organization more than anything. That is, we might want to think about a submodule for constants, if we do anticipate adding more over time.

Most of the other points have the flavor of: why stabilize just this one piece? I agree that I'd really like to have an overall vision here; I feel like every cycle we stabilize a couple of related methods. That said, your matrix is pretty useful, and does indeed suggest we should land both this and an analogous utf8 decoder.

@BurntSushi
Copy link
Member

I would be in favor of postponing until we have a more complete vision here as well (I'm a consumer of UTF-8 functions like this, which I typically just implement in-crate). There may be other routines we want to consider as well, for example, decoding a UTF-8 sequence in reverse can often be useful.

@SimonSapin
Copy link
Contributor Author

  • Do we want to support decode_utfXX_lossy? Are they easy enough to do with a .map(|e| e.unwrap_or(REPLACEMENT_CHARACTER))?

Since decode_utf16() has a dedicated return type this could also be decode_utf16().lossy(). IMO this is nicer than a function that would need to be imported separately. FWIW I had this in the PR originally and removed it at your suggestion :) #27808 (comment)

  • Having only one char constant in the module seems kinda odd, surely there are more constant-like things (like newline?) which we may wish to have?

We can remove the constant if lossy decoding is built-in.

I guess in that sense we'd be "complete" with decode_utf8? Is that all we need to put this issue to rest?

Yes, I do think we’re missing something lower-level than we currently have in std for decoding UTF-8. Something that does not allocate (could potentially be in libcore) and is more flexible than str::from_utf8 (which is strict (not lossy) and requires the entire input to be in a single slice).

But since &str and String use UTF-8 internally I don’t think decoding to Iterator<char> is ideal. We’d like to use slices of the input when possible.

I have some experiments at https://github.com/SimonSapin/rust-utf8. (In you’re interested, the commit history shows a number of different APIs I tried.) It supports "incremental" lossy decoding: input is a number of &[u8] chunks and a single code point may span multiple chunks. Output borrows input when possible and never allocates.

But this is significantly more API surface than, say, an iterator adaptor. And there’s probably a wide variety of use cases with slightly different constraints (@BurntSushi mentioned decoding in reverse), so I don’t know if it makes sense to try and support all of them in std.

Still, it’d be nice to have a single UTF-8 decoding primitive std that everything else could be based on. That way future performance improvements like #30740 (more may be possible with SIMD) would apply automatically, ideally. Though I don’t know yet what that primitive should be exactly, or if it’s even possible.

@BurntSushi
Copy link
Member

@SimonSapin Would fn decode_utf8(src: &[u8]) -> Result<(char, usize), ::std::str::Utf8Error> be enough? See, e.g., https://github.com/rust-lang-nursery/regex/blob/master/src/utf8.rs#L53-L64 and https://github.com/rust-lang-nursery/regex/blob/master/src/utf8.rs#L115-L117 for the reverse case. (They are both unoptimized and return an Option instead of a Result.)

@SimonSapin
Copy link
Contributor Author

@BurntSushi Building something that yields &str slices borrowing the input on top of this API is a fair amount of extra logic, even more so to support code points spanning chunks. And it would do more work than necessary (compute actual 32-bit values rather than just figure out which byte sequences are well-formed), I don’t know if the compile could optimize that away. And this API does not enable optimizations like #30740 that decode more than one code point at once.

@alexcrichton
Copy link
Member

The libs team discussed this during triage yesterday and the conclusion was to stabilize essentially everything as-is modulo changing the error returned by the iterator.

We felt that there's room for decoding an iterator of u8 to char, but we can always add that later.

alexcrichton added a commit to alexcrichton/rust that referenced this issue Apr 11, 2016
This commit applies all stabilizations, renamings, and deprecations that the
library team has decided on for the upcoming 1.9 release. All tracking issues
have gone through a cycle-long "final comment period" and the specific APIs
stabilized/deprecated are:

Stable

* `std::panic`
* `std::panic::catch_unwind` (renamed from `recover`)
* `std::panic::resume_unwind` (renamed from `propagate`)
* `std::panic::AssertUnwindSafe` (renamed from `AssertRecoverSafe`)
* `std::panic::UnwindSafe` (renamed from `RecoverSafe`)
* `str::is_char_boundary`
* `<*const T>::as_ref`
* `<*mut T>::as_ref`
* `<*mut T>::as_mut`
* `AsciiExt::make_ascii_uppercase`
* `AsciiExt::make_ascii_lowercase`
* `char::decode_utf16`
* `char::DecodeUtf16`
* `char::DecodeUtf16Error`
* `char::DecodeUtf16Error::unpaired_surrogate`
* `BTreeSet::take`
* `BTreeSet::replace`
* `BTreeSet::get`
* `HashSet::take`
* `HashSet::replace`
* `HashSet::get`
* `OsString::with_capacity`
* `OsString::clear`
* `OsString::capacity`
* `OsString::reserve`
* `OsString::reserve_exact`
* `OsStr::is_empty`
* `OsStr::len`
* `std::os::unix::thread`
* `RawPthread`
* `JoinHandleExt`
* `JoinHandleExt::as_pthread_t`
* `JoinHandleExt::into_pthread_t`
* `HashSet::hasher`
* `HashMap::hasher`
* `CommandExt::exec`
* `File::try_clone`
* `SocketAddr::set_ip`
* `SocketAddr::set_port`
* `SocketAddrV4::set_ip`
* `SocketAddrV4::set_port`
* `SocketAddrV6::set_ip`
* `SocketAddrV6::set_port`
* `SocketAddrV6::set_flowinfo`
* `SocketAddrV6::set_scope_id`
* `<[T]>::copy_from_slice`
* `ptr::read_volatile`
* `ptr::write_volatile`
* The `#[deprecated]` attribute
* `OpenOptions::create_new`

Deprecated

* `std::raw::Slice` - use raw parts of `slice` module instead
* `std::raw::Repr` - use raw parts of `slice` module instead
* `str::char_range_at` - use slicing plus `chars()` plus `len_utf8`
* `str::char_range_at_reverse` - use slicing plus `chars().rev()` plus `len_utf8`
* `str::char_at` - use slicing plus `chars()`
* `str::char_at_reverse` - use slicing plus `chars().rev()`
* `str::slice_shift_char` - use `chars()` plus `Chars::as_str`
* `CommandExt::session_leader` - use `before_exec` instead.

Closes rust-lang#27719
cc rust-lang#27751 (deprecating the `Slice` bits)
Closes rust-lang#27754
Closes rust-lang#27780
Closes rust-lang#27809
Closes rust-lang#27811
Closes rust-lang#27830
Closes rust-lang#28050
Closes rust-lang#29453
Closes rust-lang#29791
Closes rust-lang#29935
Closes rust-lang#30014
Closes rust-lang#30752
Closes rust-lang#31262
cc rust-lang#31398 (still need to deal with `before_exec`)
Closes rust-lang#31405
Closes rust-lang#31572
Closes rust-lang#31755
Closes rust-lang#31756
bors added a commit that referenced this issue Apr 12, 2016
std: Stabilize APIs for the 1.9 release

This commit applies all stabilizations, renamings, and deprecations that the
library team has decided on for the upcoming 1.9 release. All tracking issues
have gone through a cycle-long "final comment period" and the specific APIs
stabilized/deprecated are:

Stable

* `std::panic`
* `std::panic::catch_unwind` (renamed from `recover`)
* `std::panic::resume_unwind` (renamed from `propagate`)
* `std::panic::AssertUnwindSafe` (renamed from `AssertRecoverSafe`)
* `std::panic::UnwindSafe` (renamed from `RecoverSafe`)
* `str::is_char_boundary`
* `<*const T>::as_ref`
* `<*mut T>::as_ref`
* `<*mut T>::as_mut`
* `AsciiExt::make_ascii_uppercase`
* `AsciiExt::make_ascii_lowercase`
* `char::decode_utf16`
* `char::DecodeUtf16`
* `char::DecodeUtf16Error`
* `char::DecodeUtf16Error::unpaired_surrogate`
* `BTreeSet::take`
* `BTreeSet::replace`
* `BTreeSet::get`
* `HashSet::take`
* `HashSet::replace`
* `HashSet::get`
* `OsString::with_capacity`
* `OsString::clear`
* `OsString::capacity`
* `OsString::reserve`
* `OsString::reserve_exact`
* `OsStr::is_empty`
* `OsStr::len`
* `std::os::unix::thread`
* `RawPthread`
* `JoinHandleExt`
* `JoinHandleExt::as_pthread_t`
* `JoinHandleExt::into_pthread_t`
* `HashSet::hasher`
* `HashMap::hasher`
* `CommandExt::exec`
* `File::try_clone`
* `SocketAddr::set_ip`
* `SocketAddr::set_port`
* `SocketAddrV4::set_ip`
* `SocketAddrV4::set_port`
* `SocketAddrV6::set_ip`
* `SocketAddrV6::set_port`
* `SocketAddrV6::set_flowinfo`
* `SocketAddrV6::set_scope_id`
* `<[T]>::copy_from_slice`
* `ptr::read_volatile`
* `ptr::write_volatile`
* The `#[deprecated]` attribute
* `OpenOptions::create_new`

Deprecated

* `std::raw::Slice` - use raw parts of `slice` module instead
* `std::raw::Repr` - use raw parts of `slice` module instead
* `str::char_range_at` - use slicing plus `chars()` plus `len_utf8`
* `str::char_range_at_reverse` - use slicing plus `chars().rev()` plus `len_utf8`
* `str::char_at` - use slicing plus `chars()`
* `str::char_at_reverse` - use slicing plus `chars().rev()`
* `str::slice_shift_char` - use `chars()` plus `Chars::as_str`
* `CommandExt::session_leader` - use `before_exec` instead.

Closes #27719
cc #27751 (deprecating the `Slice` bits)
Closes #27754
Closes #27780
Closes #27809
Closes #27811
Closes #27830
Closes #28050
Closes #29453
Closes #29791
Closes #29935
Closes #30014
Closes #30752
Closes #31262
cc #31398 (still need to deal with `before_exec`)
Closes #31405
Closes #31572
Closes #31755
Closes #31756
alexcrichton added a commit to alexcrichton/rust that referenced this issue Apr 12, 2016
This commit applies all stabilizations, renamings, and deprecations that the
library team has decided on for the upcoming 1.9 release. All tracking issues
have gone through a cycle-long "final comment period" and the specific APIs
stabilized/deprecated are:

Stable

* `std::panic`
* `std::panic::catch_unwind` (renamed from `recover`)
* `std::panic::resume_unwind` (renamed from `propagate`)
* `std::panic::AssertUnwindSafe` (renamed from `AssertRecoverSafe`)
* `std::panic::UnwindSafe` (renamed from `RecoverSafe`)
* `str::is_char_boundary`
* `<*const T>::as_ref`
* `<*mut T>::as_ref`
* `<*mut T>::as_mut`
* `AsciiExt::make_ascii_uppercase`
* `AsciiExt::make_ascii_lowercase`
* `char::decode_utf16`
* `char::DecodeUtf16`
* `char::DecodeUtf16Error`
* `char::DecodeUtf16Error::unpaired_surrogate`
* `BTreeSet::take`
* `BTreeSet::replace`
* `BTreeSet::get`
* `HashSet::take`
* `HashSet::replace`
* `HashSet::get`
* `OsString::with_capacity`
* `OsString::clear`
* `OsString::capacity`
* `OsString::reserve`
* `OsString::reserve_exact`
* `OsStr::is_empty`
* `OsStr::len`
* `std::os::unix::thread`
* `RawPthread`
* `JoinHandleExt`
* `JoinHandleExt::as_pthread_t`
* `JoinHandleExt::into_pthread_t`
* `HashSet::hasher`
* `HashMap::hasher`
* `CommandExt::exec`
* `File::try_clone`
* `SocketAddr::set_ip`
* `SocketAddr::set_port`
* `SocketAddrV4::set_ip`
* `SocketAddrV4::set_port`
* `SocketAddrV6::set_ip`
* `SocketAddrV6::set_port`
* `SocketAddrV6::set_flowinfo`
* `SocketAddrV6::set_scope_id`
* `<[T]>::copy_from_slice`
* `ptr::read_volatile`
* `ptr::write_volatile`
* The `#[deprecated]` attribute
* `OpenOptions::create_new`

Deprecated

* `std::raw::Slice` - use raw parts of `slice` module instead
* `std::raw::Repr` - use raw parts of `slice` module instead
* `str::char_range_at` - use slicing plus `chars()` plus `len_utf8`
* `str::char_range_at_reverse` - use slicing plus `chars().rev()` plus `len_utf8`
* `str::char_at` - use slicing plus `chars()`
* `str::char_at_reverse` - use slicing plus `chars().rev()`
* `str::slice_shift_char` - use `chars()` plus `Chars::as_str`
* `CommandExt::session_leader` - use `before_exec` instead.

Closes rust-lang#27719
cc rust-lang#27751 (deprecating the `Slice` bits)
Closes rust-lang#27754
Closes rust-lang#27780
Closes rust-lang#27809
Closes rust-lang#27811
Closes rust-lang#27830
Closes rust-lang#28050
Closes rust-lang#29453
Closes rust-lang#29791
Closes rust-lang#29935
Closes rust-lang#30014
Closes rust-lang#30752
Closes rust-lang#31262
cc rust-lang#31398 (still need to deal with `before_exec`)
Closes rust-lang#31405
Closes rust-lang#31572
Closes rust-lang#31755
Closes rust-lang#31756
@lambda-fairy
Copy link
Contributor

Is there any reason that these items are defined in std_unicode and not core?

I was investigating #49319 and ended up here.

@SimonSapin
Copy link
Contributor Author

I’ll respond in #49319 since this thread has been closed for two years :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-unstable Blocker: Implemented in the nightly compiler and unstable. final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants