-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase default partition column type from Dict(UInt8) to Dict(UInt16) #1860
Conversation
What do you think if we make it a type parameter for the write partition dict? There are chances we may write to more than |
That is a possibility, on a very large machine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind using UInt8 is that the values of a given partition column within a file will be all identical. If I have to materialize a large array with only zeros, I would rather not encode each 0 on 64 bits 😄. To actually have a record batch with multiple partition values, you would need to go through something like the |
I think this PR proposes to use 16 bits rather than 64 to allow more than 256 distinct partition values. One example usecase might be when there are more than 256 distinct postal codes in the United States)
Having some way to dynamically pick the size of the dictionary keys certainly seems like a nice feature -- I am not sure how large of a change it would be though. |
I think (though I did not check) that the way the concat kernel in arrow works now is that the output type is always the same as the input type. Having the concat kernel upcast the index type (e.g. from UInt8 to UInt16) if the concatenated dictionary required it would be nice for sure |
Just to be clear, I came across this with date partitions :) a year of dates = 365 partition values |
The simple way to test this is to have a test with more than 256 partition values in listing::helpers |
I am not challenging that you can have partitions keys with billions of different values 🙂. But I think that this isn't the best place to bump the dictionary index size as it is correct to say that at the file level, you cannot have more than one different value in a partition column for one record batch. It would be nicer to upcast this type downstream, when the record batches are manipulated in a way that implies that this uniqueness doesn't hold anymore (like after a If we find that it is too complex to do it downstream, I am not firmly opposed to upcast the type here, but then I agree with @yjshen that u16 isn't really enough. Also, making it customizable introduces some tuning complexity that isn't really ideal either. |
@rdettai what would you think about merging this PR as a temporary workaround for common cases (like days of the year) and filing a ticket (I am happy to do so) to track the more optimal behavior? |
I mean I can change it to u32 if that floats your boat. |
In order to unstick this PR I plan to file a follow on ticket to add a more sophisticated handling of dictionaries and then merge this PR in as a workaround until it is done |
Which issue does this PR close?
Closes #1859
Rationale for this change
I have large numbers of partition values
What changes are included in this PR?
Default data type of dictionary values for partitions is now Dictionary
Are there any user-facing changes?
Getting partition values from the RecordBatch now changes to using Uint16