Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaced panics by errors on invalid pages #188

Merged
merged 4 commits into from
Aug 20, 2022

Conversation

evanrichter
Copy link
Contributor

@evanrichter evanrichter commented Aug 17, 2022

fuzzing was turning up errors again, so I am making best-guess fixes (one per commit)

let me know if any of my fixes are the wrong idea and I can fix it up!

@codecov-commenter
Copy link

codecov-commenter commented Aug 17, 2022

Codecov Report

Merging #188 (e91f2b4) into main (a0e66c2) will decrease coverage by 0.09%.
The diff coverage is 42.85%.

@@            Coverage Diff             @@
##             main     #188      +/-   ##
==========================================
- Coverage   85.71%   85.62%   -0.10%     
==========================================
  Files          84       84              
  Lines        8227     8243      +16     
==========================================
+ Hits         7052     7058       +6     
- Misses       1175     1185      +10     
Impacted Files Coverage Δ
src/read/compression.rs 92.40% <27.27%> (-3.57%) ⬇️
src/read/page/reader.rs 86.31% <57.14%> (-1.26%) ⬇️
src/compression.rs 92.08% <66.66%> (-0.36%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@evanrichter
Copy link
Contributor Author

evanrichter commented Aug 18, 2022

fuzzing seems pretty stable for now, but I will let it run overnight

edit: no further crashes found :)

Copy link
Owner

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far! I have one comment regarding the removal of try_reserve that I think we need to do something about.

@@ -200,13 +200,18 @@ pub(super) fn build_page<R: Read>(
let read_size: usize = page_header.compressed_page_size.try_into()?;

buffer.clear();
buffer.try_reserve(read_size)?;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to introduce a new parameter to the reader that controls the maximum allowed (compressed) size of the page. Maybe we could re-use the parameter of the header and generalize it for total page size (so, header + compressed). We then control for read_size < max_size?

imo this reserve is important as we do have important information that we would otherwise ignore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it looks like buffer is a re-used Vec (is this right?), then the allocation cost will be amortized to zero cost after a couple resizes. If we want to avoid the initial resizes and avoid adding a parameter, maybe we could do something like:

    const MAX_PREALLOC_SIZE: usize = 4 * 1024 * 1024;
    let prealloc_size = MAX_PREALLOC_SIZE.min(read_size);

    buffer.clear();
    buffer.try_reserve(prealloc_size)?;

The check if bytes_read != read_size { ... is still necessary to catch out-of-spec page headers.

If a user has such a large file that is larger than their memory, it would be expected that performance would start to fall right?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it looks like buffer is a re-used Vec (is this right?), then the allocation cost will be amortized to zero cost after a couple resizes.

Yes, but there are very large pages out there (e.g. from 1Mb to 1G). So, not reserving can lead to multiple resizes of a potentially large number of bytes. In particular read_to_end reads in chunks of 32 bytes ^1, so this statement will now be quite expensive (even in the first time).

Another way of looking to it is: we are given a piece of information that 99.9999% or so of the times is correct (the compressed page length). Not using it to cater for the 0.0001% cases (malicious, etc.) is sub-optimal.

This is why I think we should consider a parameter, "max_page_size", and bail if we try to allocate more than that. I think this parameter should be represent max_uncompressed_size, and we use it both to limit the reading of the compressed page, and limit the number of bytes we decompress (to mitigate zip bombs). I think that this is the parameter that we are looking for to cater both cases (over allocation on read and zip bombs).

This also caters for the page header size, since that is decompressed by default.

Copy link
Contributor Author

@evanrichter evanrichter Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this parameter should be represent max_uncompressed_size, and we use it both to limit the reading of the compressed page, and limit the number of bytes we decompress

yeah that sounds good! The limit is implemented as a field on PageReader in the latest push. It actually replaces max_header_size because if I understand correctly, any page size limit should easily keep a header size in bounds as well.

this doesn't protect against zip bombs yet. maybe adding the threshold as a field to CompressedDataPage and CompressedDictPage would be a good approach?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly! :)

this doesn't protect against zip bombs yet. maybe adding the threshold as a field to CompressedDataPage and CompressedDictPage would be a good approach?

yes, my point is that, as a end user, you only need to think about one parameter instead of 2 (page_size,page_uncompressed) or 3 (page_header_size,page_size,_page_uncompressed), so what we are doing here will not negatively impact developer experience :)

Yes, we will need to introduce this parameter in the decompression path also ^^

@evanrichter evanrichter marked this pull request as ready for review August 19, 2022 19:00
@jorgecarleitao jorgecarleitao changed the title fixes for fuzzing bugs Replaced panics by errors on invalid pages Aug 20, 2022
@jorgecarleitao jorgecarleitao added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Aug 20, 2022
@jorgecarleitao jorgecarleitao merged commit d142a95 into jorgecarleitao:main Aug 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants