-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replaced panics by errors on invalid pages #188
Replaced panics by errors on invalid pages #188
Conversation
Codecov Report
@@ Coverage Diff @@
## main #188 +/- ##
==========================================
- Coverage 85.71% 85.62% -0.10%
==========================================
Files 84 84
Lines 8227 8243 +16
==========================================
+ Hits 7052 7058 +6
- Misses 1175 1185 +10
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
fuzzing seems pretty stable for now, but I will let it run overnight edit: no further crashes found :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great so far! I have one comment regarding the removal of try_reserve
that I think we need to do something about.
@@ -200,13 +200,18 @@ pub(super) fn build_page<R: Read>( | |||
let read_size: usize = page_header.compressed_page_size.try_into()?; | |||
|
|||
buffer.clear(); | |||
buffer.try_reserve(read_size)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to introduce a new parameter to the reader that controls the maximum allowed (compressed) size of the page. Maybe we could re-use the parameter of the header and generalize it for total page size (so, header + compressed). We then control for read_size < max_size
?
imo this reserve is important as we do have important information that we would otherwise ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it looks like buffer
is a re-used Vec (is this right?), then the allocation cost will be amortized to zero cost after a couple resizes. If we want to avoid the initial resizes and avoid adding a parameter, maybe we could do something like:
const MAX_PREALLOC_SIZE: usize = 4 * 1024 * 1024;
let prealloc_size = MAX_PREALLOC_SIZE.min(read_size);
buffer.clear();
buffer.try_reserve(prealloc_size)?;
The check if bytes_read != read_size { ...
is still necessary to catch out-of-spec page headers.
If a user has such a large file that is larger than their memory, it would be expected that performance would start to fall right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it looks like buffer is a re-used Vec (is this right?), then the allocation cost will be amortized to zero cost after a couple resizes.
Yes, but there are very large pages out there (e.g. from 1Mb to 1G). So, not reserving can lead to multiple resizes of a potentially large number of bytes. In particular read_to_end
reads in chunks of 32 bytes ^1, so this statement will now be quite expensive (even in the first time).
Another way of looking to it is: we are given a piece of information that 99.9999% or so of the times is correct (the compressed page length). Not using it to cater for the 0.0001% cases (malicious, etc.) is sub-optimal.
This is why I think we should consider a parameter, "max_page_size", and bail if we try to allocate more than that. I think this parameter should be represent max_uncompressed_size
, and we use it both to limit the reading of the compressed page, and limit the number of bytes we decompress (to mitigate zip bombs). I think that this is the parameter that we are looking for to cater both cases (over allocation on read and zip bombs).
This also caters for the page header size, since that is decompressed by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this parameter should be represent max_uncompressed_size, and we use it both to limit the reading of the compressed page, and limit the number of bytes we decompress
yeah that sounds good! The limit is implemented as a field on PageReader
in the latest push. It actually replaces max_header_size
because if I understand correctly, any page size limit should easily keep a header size in bounds as well.
this doesn't protect against zip bombs yet. maybe adding the threshold as a field to CompressedDataPage
and CompressedDictPage
would be a good approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly! :)
this doesn't protect against zip bombs yet. maybe adding the threshold as a field to CompressedDataPage and CompressedDictPage would be a good approach?
yes, my point is that, as a end user, you only need to think about one parameter instead of 2 (page_size,page_uncompressed) or 3 (page_header_size,page_size,_page_uncompressed), so what we are doing here will not negatively impact developer experience :)
Yes, we will need to introduce this parameter in the decompression path also ^^
e91f2b4
to
e801e1e
Compare
fuzzing was turning up errors again, so I am making best-guess fixes (one per commit)
let me know if any of my fixes are the wrong idea and I can fix it up!