-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] ReadNext in arrow::RecordBatchReader returns invalid status on second or subsequent items #41339
Comments
Would you mind upload the parquet file? I cannot debug quickly without the file... |
here it is. gzipped b/c you can't upload a parquet file raw. |
Aha, I use master code and run arrow::Status ReadInBatches(std::string path_to_file) {
// #include "arrow/io/api.h"
// #include "arrow/parquet/arrow/reader.h"
arrow::MemoryPool* pool = arrow::default_memory_pool();
// Configure general Parquet reader settings
auto reader_properties = parquet::ReaderProperties(pool);
reader_properties.set_buffer_size(4096 * 4);
reader_properties.enable_buffered_stream();
// Configure Arrow-specific Parquet reader settings
auto arrow_reader_props = parquet::ArrowReaderProperties();
arrow_reader_props.set_batch_size(3); // default 64 * 1024
arrow_reader_props.set_use_threads(true);
parquet::arrow::FileReaderBuilder reader_builder;
ARROW_RETURN_NOT_OK(
reader_builder.OpenFile(path_to_file, /*memory_map=*/true, reader_properties));
reader_builder.memory_pool(pool);
reader_builder.properties(arrow_reader_props);
std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
ARROW_ASSIGN_OR_RAISE(arrow_reader, reader_builder.Build());
std::shared_ptr<::arrow::RecordBatchReader> rb_reader;
ARROW_RETURN_NOT_OK(arrow_reader->GetRecordBatchReader(&rb_reader));
std::shared_ptr<::arrow::RecordBatch> batch;
while (rb_reader->ReadNext(&batch).ok() && batch != nullptr) {
std::cout << "Read:" << batch->ToString() << '\n';
}
// for (arrow::Result<std::shared_ptr<arrow::RecordBatch>> maybe_batch : *rb_reader) {
// if (!maybe_batch.ok()) {
// std::cout << "Error reading batch: " << maybe_batch.status().message() << std::endl;
// } else {
// std::shared_ptr<arrow::RecordBatch> batch = maybe_batch.ValueOrDie();
// std::cout << "Read batch with " << batch->num_rows() << " rows" << std::endl;
// }
// }
return arrow::Status::OK();
}
arrow::Status RunExamples(std::string path_to_file) {
// ARROW_RETURN_NOT_OK(WriteFullFile(path_to_file));
// ARROW_RETURN_NOT_OK(ReadFullFile(path_to_file));
// ARROW_RETURN_NOT_OK(WriteInBatches(path_to_file));
ARROW_RETURN_NOT_OK(ReadInBatches(path_to_file));
return arrow::Status::OK();
} This doesn't crash. I'm running on My M1 MacOS and master branch. Would you mind provide some configs? By the way, the stack below it's a little confusing, 🤔 why
|
I think the issue I hit is that the I'm creating the FileReader and the RecordBatchReader in a function. It appears that the record batch reader doesn't grab a reference to its parent arrow::parquet::FileReader, so you have to save that separately as well as the RecordBatchReader So not a bug, but very hard to figure out. |
Describe the bug, including details regarding any error messages, version, and platform.
I have some code that is trying to iterate through record batches of fairly large parquet files
The code is
and the stack trace is
I'm a bit at a loss for why this would happen.
I've also seen some references to
Invalid: Buffer #1 too small in array of type int64 and length 3: expected at least 24 byte(s), got 0
when working with extremely wide parquet files.The Parquet file is fine -- I can read it with ReadTable, pyarrow, etc. It even works if the batch size is sufficiently large to read it in one batch.
Any ideas as to why it would run out of buffers even if I'm only reading batch sizes of 3?
Component(s)
C++
The text was updated successfully, but these errors were encountered: