-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Do less column processing when pretty printing #1788
Conversation
Avoids some unneeded work when processing bathes for pretty printing.
// Avoid trying to process the entire batch. | ||
let batch = batch.slice(rows.start, rows.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if after the reslice the batch is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then ColumnValues
will be empty, and nothing will be added to the table since the for batch_idx in 0..batch.num_rows()
won't have any iterations.
for row in rows { | ||
// Note this is starting from the beginning of the sliced batch. | ||
for batch_idx in 0..batch.num_rows() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I did have to look this up, but don't Range
objects yield numbers that start at their starting point, and end at (the end), and what we've done is reslice the batch according to the bounds of the rows and then iterate over the batch indexes? Seems like this change shouldn't have an effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this change shouldn't have an effect.
Yes, this function should have no affect on the output. The real savings is avoiding passing the entire batch to ColumnValues
in the previous loop. ColumnValues
is what's doing fancy string truncation, and that's what I wanted to reduce by having the batch slicing above.
E.g. If we have 3 batches of twenty records each, and we want to output the first 10 records in the first batch, and the last 10 records in the last batch, the previous impl would do fancy string truncation on all records in all columns. With the slicing in this pr, only the records that we're actually outputting to the table are being processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so you could have say inserted slice()
into the chain of the column processing, but not changed the rows range loop, as in:
fn process_batch(
table: &mut Table,
format: &TableFormat,
batch: &RecordBatch,
rows: Range<usize>,
) -> Result<(), ArrowError> {
if rows.is_empty() {
return Ok(());
}
let mut col_vals = Vec::new();
// only process columns present in the given batch:
for (idx, col) in batch.slice(rows.start, rows.len()).columns().iter().enumerate() {
if format.is_elided[idx] {
continue;
}
let vals = ColumnValues::try_new_from_array(col, format.widths[idx])?;
col_vals.push(vals);
}
// add rows from the provided range into the table:
for row in rows {
let mut cells = Vec::with_capacity(col_vals.len());
for col in col_vals.iter_mut() {
cells.push(std::mem::take(col.vals.get_mut(row).unwrap()));
}
if format.has_ellided() {
cells.insert(elide_index(&cells), "…".to_string());
}
table.add_row(cells);
}
Ok(())
}
I don't actually feel strongly about this at all. I was just confused because the change looked like it should have been a noop (but subtly wasn't), and was thrown because the big part of the change was the part you didn't change :D :D
(also it's possible that I misunderstood and idiom)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subtlety tricky part here is the second loop, as we need to start from index zero when calling col.vals.get_mut(row)
since we're slicing the batch.
So with what you had, we'd also want to change this:
for row in rows {
to this:
for ro in 0..rows.len()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that the case if we take out the reassignment of the batch variable in my example? since the slice happens inline in the first loop (and batch isn't mutable), the second loop doesn't need to change at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would still need to change, since ColumnValues
only stores the number of records from the column that's passed in.
So if we have a batch of 20 records, and the row range is {start: 10, end: 19}
, then there's only 10 values in ColumnValues
where index 0 in that corresponds to index 10 in the original batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, indeed! I think I get it now. it's really easy to misread this. (something something regression test?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test added.
Avoids some unneeded work when processing bathes for pretty printing.