-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix some unsound unsafe code in arrow and attempt to improve safety docs. #6021
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @veluca93 -- I looked at some of these changes -- some places definitely look like there could problems, though I am not quite sure
I wonder if you would be willing to break this PR into smaller independent PRs for review
for example,
- the changes to parquet might go in one PR
- Changes for list/map arrays in another
- changes for dictionary array in another
This would both be easier to review and let us merge the parts that get consensus while we work through the others
cc @XiangpengHao and @tustvold @jhorstmann for your comments
let len_bytes = buf.get(self.offset..self.offset + 4); | ||
let len = match len_bytes { | ||
None => { | ||
return Err(ParquetError::EOF("eof decoding byte array".into())); | ||
} | ||
Some(l) => u32::from_le_bytes(l.try_into().unwrap()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this any more sound? it seems to me this is simply does the same check via using slice
BTW I think you can write this kind of code more concisely like
let len_bytes = buf.get(self.offset..self.offset + 4); | |
let len = match len_bytes { | |
None => { | |
return Err(ParquetError::EOF("eof decoding byte array".into())); | |
} | |
Some(l) => u32::from_le_bytes(l.try_into().unwrap()), | |
let Some(len_bytes) = buf.get(self.offset..self.offset + 4) else { | |
return Err(ParquetError::EOF("eof decoding byte array".into())); | |
}; | |
let len = u32::from_le_bytes(len_bytes.try_into().unwrap()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was trying to avoid the version with let .. else
because I was not sure on the MSRV policies of arrow-rs :-)
As for being more sound: the original version is unsound if self.offset + 4
overflows (because it ends up calling .get_unchecked with an empty range, which is UB). In general, the new version achieves the same result without unsafe, which is IMO always a good idea (unless there are big performance issues).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think allocation sizes in rust are limited to isize::MAX
, so offset+4
should never overflow.
Personally, I find the previous version with the simple if
a bit easier to read. The unsafe
can probably be removed since the compiler should be able to remove the bounds checks on its own:
let len = u32::from_le_bytes(buf[offset..offset+4].try_into().unwrap())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is also fine IMO - I just wanted to remove the unsafe
, I attempted to do so without introducing new potential branches but if we believe the compiler to remove them (or they are not perf-critical) then this is indeed the clearest option.
|
||
let start_offset = self.offset + 4; | ||
let end_offset = start_offset + len as usize; | ||
if end_offset > buf.len() { | ||
if end_offset > buf.len() || end_offset < start_offset { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since offset is always increasing (it is formed by start offset + a usize
how could it ever be smaller than start offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It normally cannot, except in case of overflows :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we then rather use checked_add
in the previous line and handle and return an error on overflow? Or use wrapping_add
and keep this check, so it is clearer to the reader that overflow is being considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably checked_add
is the best choice (as it is more explicit). Commenting that self.offset
is always at most isize::MAX
, hence there is no overflow, would also be fine IMO.
} | ||
|
||
// SAFETY: bounds are checked above, block_id comes from a call to append_block, and | ||
// we are dealing with byte arrays with no alignment requirements. | ||
// FIXME: what about UTF8 arrays if validate_utf8 is false? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validate_utf8 is false only for ByteViewArray
(that can have arbitrary bytes) -- for StringViewArray. validate_utf8
is always true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW here are some more obsessive tests showing this is correct; #6023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this is up to the caller to ensure - i.e. the caller must ensure (i.e. when calling new
) that validate_utf8
is true if the data does not come from a StringViewArray. If that is correct, then this leads to potential UB if the caller does not uphold this contract, which means that new
should be an unsafe function and this property should ideally be documented as a safety invariant for the struct.
But I may have misunderstood something :-)
@@ -435,6 +435,10 @@ impl BitReader { | |||
/// This function panics if | |||
/// - `num_bits` is larger than the bit-capacity of `T` | |||
/// | |||
// FIXME: soundness issue - this method can be used to write arbitrary bytes to any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you provide an example where this leads to unsound behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Safe code could implement, for example, FromBytes on Box. Then constructing a BitReader from a buffer of 0 Bytes, and calling get_batch on a slice of Box would end up creating a Box containing a null pointer, which is UB.
The issue here is that this code assumes (AFAIU) that all bit patterns are valid for T: FromBytes
, and this is true for the implementations of FromBytes here - but safe code could easily make it not true anymore. Thus, probably the best solution is to make FromBytes
an unsafe
trait.
@@ -461,6 +465,9 @@ impl BitReader { | |||
match size_of::<T>() { | |||
1 => { | |||
let ptr = batch.as_mut_ptr() as *mut u8; | |||
// SAFETY: batch is properly aligned and sized. Caller guarantees that T | |||
// can be safely seen as a slice of bytes through FromBytes bound | |||
// (FIXME: not actually true right now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ptr
is derived from &mut [T]
, so it is properly aligned and sized. Can you give an example to show that this is unsound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is indeed OK - see my previous comment for the issue here, which is not with constructing the slice but with writing to it.
Happy to split the PR up! |
Thank you @veluca93 -- I hope to find time tomorrow to review the new PRs |
This is a first step towards fixing #6020.