Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the trailing null from strings #7235

Closed
erickt opened this issue Jun 19, 2013 · 14 comments
Closed

Remove the trailing null from strings #7235

erickt opened this issue Jun 19, 2013 · 14 comments
Labels
A-FFI Area: Foreign function interface (FFI) A-Unicode Area: Unicode C-cleanup Category: PRs that clean code up or issues documenting cleanup.

Comments

@erickt
Copy link
Contributor

erickt commented Jun 19, 2013

It is unsafe to convert every &[u8] into a &str, because &strs need to have a valid byte after the end of the characters. This means that this pattern:

let v = &['a' as u8, 'b' as u8, 'c' as u8];
let s = str::from_bytes_slice(v);
str::as_c_str(s, |ptr| call_c_function(ptr));

is unsafe because str::as_c_str dereferences the next byte past the end of the slice. This is safe when working with a ~str as a &str, because it has the trailing null. To do this safely, you need to copy the bytes:

let v = &['a' as u8, 'b' as u8, 'c' as u8];
let s = str::from_bytes(v);
str::as_c_str(s, |ptr| call_c_function(ptr));

While we could work around this by marking from_bytes_slice as unsafe and documenting this unsafe behavior, it would be simplest to just remove the trailing null from strings and allow the end user to add the trailing null if they want it. .as_c_str() can still take the fast path if the string is null terminated, or allocate a temporary string otherwise.

Some comments from the IRC discussion:

@graydon suggests renaming .as_c_str() to .as_null_terminated_c_str()

@cmr suggested .as_null_terminated_str()

@bstrie suggested .null_terminate()

@Kimundi mentioned a .as_c_str() that takes a &mut ~str, adds a NULL, calls the closure, then removes the NULL before exiting.

cc: #6869

@graydon
Copy link
Contributor

graydon commented Jun 19, 2013

nominating for backwards compatibility

@lilyball
Copy link
Contributor

CStrings, by definition, are null-terminated, so I don't see any reason to change the name away from .as_c_str().

@emberian
Copy link
Member

@kballard because rust strings, as valid utf-8, can contain NUL. Thus, not all rust strings from as_c_str are the C string you would expect. Naming it away from as_c_str emphasizes this.

@thestinger
Copy link
Contributor

I don't think we should have a shortcut for temporarily copying to a nul-terminated string at all. Anything using \0 as a string terminator doesn't support UTF-8, because they are valid characters in a Unicode string.

The most common usage of nul-termination is likely to be paths, but those can't be represented with our string type because they aren't always valid Unicode so it's not something the string type can deal with.

In special cases where you really do want to abandon all hope of correctness, you can just push \0 to the end yourself or make a copy and push it. This is potentially a security issue, because string comparisons or other operations stopping at the \0 will be incorrect. Is there actually a use case for handling only a subset of possible strings correctly?

@lilyball
Copy link
Contributor

@cmr: Naming it something like .as_null_terminated_c_str() doesn't change the fact that it could contain NUL.

@Kimundi
Copy link
Member

Kimundi commented Jun 19, 2013

I wrote this under issue #6869, but seeing how there is more discussion here I'm moving it here:

Summary of the current situation

  • Every string allocation in rust is null terminated (~str, &'static str etc).
  • Just like a &[u8], a &str is a tuple (*u8, len) that points at the start of a memory range of len bytes.
  • However, while a &str points at len bytes, only len-1 bytes of that are considered part of the utf-8 data.
  • The last valid address a &str points at is the 'hidden' byte.
  • If you take a slice (start, end) of a string, you actually get a &[u8] slice of (start, end+1). This is safe because there is always a byte after the end for a string allocation.
  • For example, if you take the slice of a whole allocated string, the hidden byte points at the null terminator.
  • And if you take a subslice of a that string, the hidden byte points at whatever byte of the string allocation comes next after the end you specified.

All this is done just for one reason: So that as_c_str() can take a &str, safely look at its hidden byte, and only if it's not NULL make a copy of the string for increased safety (The c side will find a null terminator in valid memory). This design has a few issues though:

  • Converting between &str and &[u8] is a pain. While &str -> &[u8] is perfectly fine, A &[u8] of size N can only be turned into a &str of length N-1, making a lot of low level conversions a pita/impossible without some sort of copy.
  • The difference is hard to explain and library devs stumble over it all the time.
  • Because as_c_str() only does a copy 'sometimes', it's hard to gauge its performance.
  • Valid utf8 can contain interior nulls, so even with a guaranteed null terminator you might have to do additional transformations on your string to get the behavior you want of as_c_str()

Making &str the same as &[u8]

Because of this issues, &strs one-off-ness really introduces more problem than it solves, and should be changed. This means changing all functions in the library and compiler that handle a &str to not tread it as (*u8, len) with a len-1 long utf8 part and a hidden byte, but instead just as a len memory slice that points at valid utf8.

Changes to the C interface

The current as_c_str() should be replaced by a few specialized functions depending on your use case:

  • (&str).as_terminated_c_str() should assert that the last codepoint in it is '\x00', and only then provide a *u8 and len-1 pair to the string data. The docs also need to mention that the data might contain interior nulls, so the C side might see a shorter string than intended.
  • (&str).as_unterminated_c_str() should provide a *u8 and len pair to the string data. The docs need to be explicit that there is NO null termination guaranteed, but that the string might contain interior nulls, including at its end.
  • (&mut ~str).null_terminate() can be a convenience method to push a '\x00' at the end if it's not there already.
  • (&str).is_null_terminated() returns true if the last byte is null.
  • (&str).is_c_null_term_compatible() returns true if the last byte is null and all other aren't.
  • (&str).as_terminated_c_str_auto_copy() should take a &str, see if it ends with a '\x00', and if not make a copy of it, push a null, and then call as_terminated_c_str() on it (Like today's as_c_str()). For short copies it should include a smallvector buffer or similar anti-heap-allocation strategies.
  • (~str).into_terminated_c_str() -> ~str should take ownership of a ~str, push a null, call as_terminated_c_str on it, pop the null and return the ~str again, leveraging overallocation to prevent a copy.

@thestinger
Copy link
Contributor

I really don't think it deserves even one method because it's completely incorrect to treat UTF-8 this way. We aren't using modified UTF-8, so we have to accept that it's not usable with C string APIs.

Can someone point out a situation or use case where this would actually be correct/useful?

@Kimundi
Copy link
Member

Kimundi commented Jun 19, 2013

@thestinger good point, maybe the whole api should be based around &[Ascii] or &[u8].
However, most of the issues I listed above still apply at the boundaries where you convert to/from this type.

@lilyball
Copy link
Contributor

While UTF-8 allows NUL, most UTF-8 strings won't have interior NULs. And the fact of the matter is, it's awfully convenient to be able to construct a NUL-terminated C String when working with FFI functions. It would be nice not to have to copy the string to a ~[u8] every time you want to work with an FFI function.

For reference, in Cocoa, NSString is a fully unicode-aware string class, which provides a convenience method -UTF8String which returns a NUL-terminated char*, and there's been no issues with this.

@lilyball
Copy link
Contributor

Perhaps we should just implement .as_c_str() on ~str alone, and require ~str to be NUL-terminated, but not provide this functionality at all on &str. Hopefully people will be using as_buff() anyway (which should be turned into a method).

@thestinger
Copy link
Contributor

@kballard: so what's an example FFI API, where handling only a subset of strings correctly is still going to be okay? This kind of flaw can easily become a security vulnerability and I don't see why every string should be a byte longer just to make it easier to write broken code.

Encryption, compression, I/O and file paths are all byte based so the string representation having a \0 at the end isn't helpful/relevant because the bindings will use [u8].

Nothing stops you from pushing a \0 to the end of a ~str if you're determined to write code that only supports a subset of Unicode.

@lilyball
Copy link
Contributor

My general feeling here is that, once you'e dealing with FFI, you're already in unsafe land, so instead of trying to be opinionated about what is safe and what is not, we should just provide the tools to do what's useful.

That said, the argument that making every string a byte longer when most of them are never going to touch FFI code (not that this is precisely what you said) is a good one. Although I wonder how much of an impact this will have in practice.

@thestinger
Copy link
Contributor

I don't think it's useful though. I can't think of any case where it would be okay/useful to use strings for stuff that's not actually Unicode. I was also against having any ASCII-only methods (like the now removed to_lower and to_upper) on strings for the same reason - it should be separate.

If we ever make a small string class, the impact will be that strings under 10 bytes are on the stack instead of strings under 11 bytes on 32-bit, and 22 instead of 23 on 64-bit. It can also push them into the next allocator size class.

@graydon
Copy link
Contributor

graydon commented Jun 19, 2013

@thestinger suggests that we look at haskell's Foreign.C.String :: CString type (http://www.haskell.org/ghc/docs/latest/html/libraries/base/Foreign-C-String.html) and structure the library such that &str can be converted to such a CString which will ensure that it matches current C-string assumptions:

  • Encoded by current locale encoding
  • No interior nulls
  • One trailing null

IMO this relates closely to having to figure out APIs for handling non-utf8 encodings. See #6164, #4837, etc. (there should be a metabug for overhauling unicode / absorbing libICU / something; but there isn't yet)

@bors bors closed this as completed in 60f5011 Aug 10, 2013
@erickt erickt removed their assignment Jun 16, 2014
flip1995 added a commit to flip1995/rust that referenced this issue May 20, 2021
…=flip1995

Fix another manual_unwrap_or deref FP

changelog: none (since this just piggybacks on rust-lang#7233)

Fixes rust-lang#6960
flip1995 pushed a commit to flip1995/rust that referenced this issue May 20, 2021
Rollup of 3 pull requests

Successful merges:

 - rust-lang#7235 (Fix another manual_unwrap_or deref FP)
 - rust-lang#7237 (Add the command to add upstream remote)
 - rust-lang#7239 (CI: update rustup before installing the toolchain on windows)

Failed merges:

r? `@ghost`
`@rustbot` modify labels: rollup

changelog: rollup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-FFI Area: Foreign function interface (FFI) A-Unicode Area: Unicode C-cleanup Category: PRs that clean code up or issues documenting cleanup.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants