Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase default partition column type from Dict(UInt8) to Dict(UInt16) #1860

Merged
merged 2 commits into from
Mar 5, 2022

Conversation

Igosuki
Copy link
Contributor

@Igosuki Igosuki commented Feb 18, 2022

Which issue does this PR close?

Closes #1859

Rationale for this change

I have large numbers of partition values

What changes are included in this PR?

Default data type of dictionary values for partitions is now Dictionary

Are there any user-facing changes?

Getting partition values from the RecordBatch now changes to using Uint16

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Feb 18, 2022
@yjshen
Copy link
Member

yjshen commented Feb 18, 2022

What do you think if we make it a type parameter for the write partition dict? There are chances we may write to more than u16::Max partitions as well?

@Igosuki
Copy link
Contributor Author

Igosuki commented Feb 18, 2022

That is a possibility, on a very large machine

@alamb alamb changed the title Fix dict key size Increase default partition column type from Dict(UInt8) to Dict(UInt16) Mar 1, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me -- thank you @Igosuki

Some thoughts:

  1. Is there some way to test this code (mostly to ensure that something would fail if the keys changed back to UInt8?)
  2. I wonder if there is any value to making the key size configurable?

cc @rdettai

@rdettai
Copy link
Contributor

rdettai commented Mar 2, 2022

The idea behind using UInt8 is that the values of a given partition column within a file will be all identical. If I have to materialize a large array with only zeros, I would rather not encode each 0 on 64 bits 😄. To actually have a record batch with multiple partition values, you would need to go through something like the concat kernel first. Wouldn't it make sense to rely on that kernel to re-cast the index type appropriately? I think that it would be a safer approach in general to avoid overflowing when merging dictionaries.

@alamb
Copy link
Contributor

alamb commented Mar 2, 2022

The idea behind using UInt8 is that the values of a given partition column within a file will be all identical. If I have to materialize a large array with only zeros, I would rather not encode each 0 on 64 bits 😄.

I think this PR proposes to use 16 bits rather than 64 to allow more than 256 distinct partition values. One example usecase might be when there are more than 256 distinct postal codes in the United States)

To actually have a record batch with multiple partition values, you would need to go through something like the concat kernel first. Wouldn't it make sense to rely on that kernel to re-cast the index type appropriately? I think that it would be a safer approach in general to avoid overflowing when merging dictionaries.

Having some way to dynamically pick the size of the dictionary keys certainly seems like a nice feature -- I am not sure how large of a change it would be though.

@alamb
Copy link
Contributor

alamb commented Mar 2, 2022

I think (though I did not check) that the way the concat kernel in arrow works now is that the output type is always the same as the input type. Having the concat kernel upcast the index type (e.g. from UInt8 to UInt16) if the concatenated dictionary required it would be nice for sure

@Igosuki
Copy link
Contributor Author

Igosuki commented Mar 2, 2022

Just to be clear, I came across this with date partitions :) a year of dates = 365 partition values

@Igosuki
Copy link
Contributor Author

Igosuki commented Mar 2, 2022

The simple way to test this is to have a test with more than 256 partition values in listing::helpers

@rdettai
Copy link
Contributor

rdettai commented Mar 3, 2022

I think this PR proposes to use 16 bits rather than 64 to allow more than 256 distinct partition values. One example usecase might be when there are more than 256 distinct postal codes in the United States)

I am not challenging that you can have partitions keys with billions of different values 🙂. But I think that this isn't the best place to bump the dictionary index size as it is correct to say that at the file level, you cannot have more than one different value in a partition column for one record batch. It would be nicer to upcast this type downstream, when the record batches are manipulated in a way that implies that this uniqueness doesn't hold anymore (like after a concat op). Also, it would be even nicer if we had #1248 instead 😄

If we find that it is too complex to do it downstream, I am not firmly opposed to upcast the type here, but then I agree with @yjshen that u16 isn't really enough. Also, making it customizable introduces some tuning complexity that isn't really ideal either.

@alamb
Copy link
Contributor

alamb commented Mar 3, 2022

@rdettai what would you think about merging this PR as a temporary workaround for common cases (like days of the year) and filing a ticket (I am happy to do so) to track the more optimal behavior?

@Igosuki
Copy link
Contributor Author

Igosuki commented Mar 3, 2022

I mean I can change it to u32 if that floats your boat.

@alamb
Copy link
Contributor

alamb commented Mar 5, 2022

In order to unstick this PR I plan to file a follow on ticket to add a more sophisticated handling of dictionaries and then merge this PR in as a workaround until it is done

@alamb
Copy link
Contributor

alamb commented Mar 5, 2022

Filed #1931 for follow on work ; Thanks again @Igosuki and @rdettai

@alamb alamb merged commit 6cc9916 into apache:master Mar 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UInt8 isn't enough for partitioning values
4 participants