Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Do less column processing when pretty printing #1788

Merged
merged 5 commits into from
Sep 20, 2023
Merged

Conversation

scsmithr
Copy link
Member

Avoids some unneeded work when processing bathes for pretty printing.

Avoids some unneeded work when processing bathes for pretty printing.
Comment on lines +526 to +527
// Avoid trying to process the entire batch.
let batch = batch.slice(rows.start, rows.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if after the reslice the batch is empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then ColumnValues will be empty, and nothing will be added to the table since the for batch_idx in 0..batch.num_rows() won't have any iterations.

Comment on lines -531 to +539
for row in rows {
// Note this is starting from the beginning of the sliced batch.
for batch_idx in 0..batch.num_rows() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I did have to look this up, but don't Range objects yield numbers that start at their starting point, and end at (the end), and what we've done is reslice the batch according to the bounds of the rows and then iterate over the batch indexes? Seems like this change shouldn't have an effect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this change shouldn't have an effect.

Yes, this function should have no affect on the output. The real savings is avoiding passing the entire batch to ColumnValues in the previous loop. ColumnValues is what's doing fancy string truncation, and that's what I wanted to reduce by having the batch slicing above.

E.g. If we have 3 batches of twenty records each, and we want to output the first 10 records in the first batch, and the last 10 records in the last batch, the previous impl would do fancy string truncation on all records in all columns. With the slicing in this pr, only the records that we're actually outputting to the table are being processed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so you could have say inserted slice() into the chain of the column processing, but not changed the rows range loop, as in:

fn process_batch(
    table: &mut Table,
    format: &TableFormat,
    batch: &RecordBatch,
    rows: Range<usize>,
) -> Result<(), ArrowError> {
    if rows.is_empty() {
	return Ok(());
    }

    let mut col_vals = Vec::new();

    // only process columns present in the given batch:
    for (idx, col) in batch.slice(rows.start, rows.len()).columns().iter().enumerate() {
	if format.is_elided[idx] {
	    continue;
	}
	let vals = ColumnValues::try_new_from_array(col, format.widths[idx])?;
	col_vals.push(vals);
    }

    // add rows from the provided range into the table:
    for row in rows {
	let mut cells = Vec::with_capacity(col_vals.len());
	for col in col_vals.iter_mut() {
	    cells.push(std::mem::take(col.vals.get_mut(row).unwrap()));
	}

	if format.has_ellided() {
	    cells.insert(elide_index(&cells), "…".to_string());
	}

	table.add_row(cells);
    }

    Ok(())
}

I don't actually feel strongly about this at all. I was just confused because the change looked like it should have been a noop (but subtly wasn't), and was thrown because the big part of the change was the part you didn't change :D :D

(also it's possible that I misunderstood and idiom)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subtlety tricky part here is the second loop, as we need to start from index zero when calling col.vals.get_mut(row) since we're slicing the batch.

So with what you had, we'd also want to change this:

for row in rows {

to this:

for ro in 0..rows.len()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the case if we take out the reassignment of the batch variable in my example? since the slice happens inline in the first loop (and batch isn't mutable), the second loop doesn't need to change at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would still need to change, since ColumnValues only stores the number of records from the column that's passed in.

So if we have a batch of 20 records, and the row range is {start: 10, end: 19}, then there's only 10 values in ColumnValues where index 0 in that corresponds to index 10 in the original batch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, indeed! I think I get it now. it's really easy to misread this. (something something regression test?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test added.

@scsmithr scsmithr requested a review from tychoish September 19, 2023 23:19
@scsmithr scsmithr enabled auto-merge (squash) September 20, 2023 02:06
@scsmithr scsmithr merged commit 6a70827 into main Sep 20, 2023
@scsmithr scsmithr deleted the sean/less-process branch September 20, 2023 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants