-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reader<R: BufRead>
unnecessarily panics on 32-bit targets when the document size exceeds 4 GiB
#751
Comments
Seems reasonable. Mind to crate a PR? I plan to release 0.32 next few days so this change can be included there. |
@Mingun I'll give it a try! :) |
While trying to approach this issue, I noticed a much bigger issue: all internal offsets are computed using Here's a very short reproduction of the issue, beginning with fn main() {
let file = std::fs::File::open("large.xml").unwrap();
let mut reader = quick_xml::reader::Reader::from_reader(std::io::BufReader::new(file));
let mut buf = Vec::new();
loop {
match reader.read_event_into(&mut buf).unwrap() {
quick_xml::events::Event::Eof => break,
_ => (),
}
}
println!("Done.");
} You can create such a file (echo -n '<root>' ; perl -e 'print "<child>text</child>"x300000000' ; echo -n '</root>') > large.xml Then run the code in 32-bit mode: $ rustup target add i686-unknown-linux-gnu
$ cargo run --release --target i686-unknown-linux-gnu
Compiling test-quick-xml-4gb v0.1.0 (/home/tniessen/Documents/test-quick-xml-4gb)
Finished `release` profile [optimized] target(s) in 0.27s
Running `target/i686-unknown-linux-gnu/release/test-quick-xml-4gb`
thread 'main' panicked at library/alloc/src/raw_vec.rs:25:5:
capacity overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace As you can see, and as expected from the internal use of Since |
Reader<R: BufRead>
unnecessarily panics on 32-bit targets when the document size exceeds 4 GiB
Looking at the same thing right now.
Unless I missed something, it looks like we could get away with keeping everything memory related in |
Yes, internal offsets should be stored in I did a labor work of prototyping in u64-offsets branch. |
If I'm understanding the question correctly, then the answer is no. A |
@Mingun cherry pick fixes for tests from pronebird@a393ffb |
@pronebird @Mingun I apologize, my previous example did not clear the buffer in-between events properly. This is the correct reproduction: fn main() {
let file = std::fs::File::open("large.xml").unwrap();
let mut reader = quick_xml::reader::Reader::from_reader(std::io::BufReader::new(file));
let mut buf = Vec::new();
loop {
match reader.read_event_into(&mut buf).unwrap() {
quick_xml::events::Event::Eof => break,
_ => (),
}
buf.clear(); // This was missing above.
}
println!("Done.");
} Now we see the expected panic: $ cargo run --target i686-unknown-linux-gnu
Compiling cfg-if v1.0.0
Compiling memchr v2.7.4
Compiling encoding_rs v0.8.34
Compiling quick-xml v0.33.0
Compiling test-quick-xml-4gb v0.1.0 (/home/tniessen/Documents/test-quick-xml-4gb)
Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.13s
Running `target/i686-unknown-linux-gnu/debug/test-quick-xml-4gb`
thread 'main' panicked at /home/tniessen/.cargo/registry/src/index.crates.io-6f17d22bba15001f/quick-xml-0.33.0/src/reader/buffered_reader.rs:286:5:
attempt to add with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace |
@tniessen, could you test that with u64-offsets branch? |
@Mingun Neither debug nor release mode panic with that branch (using the minimal repro from my last comment). |
Cool! I would be good to finish that work by thinking how we can test it. Of course, having 4GB+ file in the repository is not an option 😄. We could create a special |
You probably mean exceeds Graceful solution would be to use checked arithmetics all over the place, catch overflows and return an |
I made a some checks and here is the results:
"small text" is the <child>text</child> "long text" is the Lorem Ipsum with 100 paragraphs generated by https://ru.lipsum.com/feed/html, 61475 bytes long in the following frame: <content description="Text generated by https://ru.lipsum.com/feed/html: 100 paragraphs">
Lorem Ipsum...
</content> "With checks" is the The source of XML: struct Fountain<'a> {
/// That piece of data repeated infinitely...
chunk: &'a [u8],
/// Part of `chunk` that was consumed by BufRead impl
consumed: usize,
/// The overall count of read bytes
overall_read: u64,
}
impl<'a> io::Read for Fountain<'a> {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
let available = &self.chunk[self.consumed..];
let len = buf.len().min(available.len());
let (portion, _) = available.split_at(len);
buf.copy_from_slice(portion);
Ok(len)
}
}
impl<'a> io::BufRead for Fountain<'a> {
fn fill_buf(&mut self) -> io::Result<&[u8]> {
Ok(&self.chunk[self.consumed..])
}
fn consume(&mut self, amt: usize) {
self.consumed += amt;
if self.consumed == self.chunk.len() {
self.consumed = 0;
}
self.overall_read += amt as u64;
}
} Test is ended when Because I ran tests on x86_64, I did not run into the problems, so the only question remains how to test that on the correct target to hit the error. I think that should be possible to ensure that we have such target on CI. |
Thank you @Mingun and @pronebird for your work on this! |
I also just released 0.34.0 with this fix. |
quick_xml::reader::Reader<R: BufRead>
allows reading arbitrarily large files without allocating much memory. I am using this in combination withflate2::read::GzDecoder
in order to decompress and simultaneously parse very large XML files (several gigabytes per document) without having to store the decompressed file (either in memory or on disk).Unfortunately, because offsets within the document are computed as
usize
internally, this causesquick-xml
to panic when the document size exceeds 4 GiB on 32-bit targets.Notably,
buffer_position()
returnsusize
, which is not necessarily sufficient for representing the correct position if the input document size exceeds 4 GiB.Rust's built-in
std::io
types typically useu64
for file sizes (andi64
for relative offsets within files). Hence, I believe it would be appropriate to useu64
for internal computations (and to return it frombuffer_position()
etc.) as well.The text was updated successfully, but these errors were encountered: