Increase default partition column type from Dict(UInt8) to Dict(UInt16) #1860

Igosuki · 2022-02-18T01:51:29Z

Which issue does this PR close?

Closes #1859

Rationale for this change

I have large numbers of partition values

What changes are included in this PR?

Default data type of dictionary values for partitions is now Dictionary

Are there any user-facing changes?

Getting partition values from the RecordBatch now changes to using Uint16

yjshen · 2022-02-18T03:54:22Z

What do you think if we make it a type parameter for the write partition dict? There are chances we may write to more than u16::Max partitions as well?

Igosuki · 2022-02-18T11:03:08Z

That is a possibility, on a very large machine

alamb

Looks reasonable to me -- thank you @Igosuki

Some thoughts:

Is there some way to test this code (mostly to ensure that something would fail if the keys changed back to UInt8?)
I wonder if there is any value to making the key size configurable?

cc @rdettai

rdettai · 2022-03-02T10:28:58Z

The idea behind using UInt8 is that the values of a given partition column within a file will be all identical. If I have to materialize a large array with only zeros, I would rather not encode each 0 on 64 bits 😄. To actually have a record batch with multiple partition values, you would need to go through something like the concat kernel first. Wouldn't it make sense to rely on that kernel to re-cast the index type appropriately? I think that it would be a safer approach in general to avoid overflowing when merging dictionaries.

alamb · 2022-03-02T16:33:06Z

The idea behind using UInt8 is that the values of a given partition column within a file will be all identical. If I have to materialize a large array with only zeros, I would rather not encode each 0 on 64 bits 😄.

I think this PR proposes to use 16 bits rather than 64 to allow more than 256 distinct partition values. One example usecase might be when there are more than 256 distinct postal codes in the United States)

To actually have a record batch with multiple partition values, you would need to go through something like the concat kernel first. Wouldn't it make sense to rely on that kernel to re-cast the index type appropriately? I think that it would be a safer approach in general to avoid overflowing when merging dictionaries.

Having some way to dynamically pick the size of the dictionary keys certainly seems like a nice feature -- I am not sure how large of a change it would be though.

alamb · 2022-03-02T16:35:03Z

I think (though I did not check) that the way the concat kernel in arrow works now is that the output type is always the same as the input type. Having the concat kernel upcast the index type (e.g. from UInt8 to UInt16) if the concatenated dictionary required it would be nice for sure

Igosuki · 2022-03-02T19:16:39Z

Just to be clear, I came across this with date partitions :) a year of dates = 365 partition values

Igosuki · 2022-03-02T19:19:08Z

The simple way to test this is to have a test with more than 256 partition values in listing::helpers

rdettai · 2022-03-03T14:42:08Z

I think this PR proposes to use 16 bits rather than 64 to allow more than 256 distinct partition values. One example usecase might be when there are more than 256 distinct postal codes in the United States)

I am not challenging that you can have partitions keys with billions of different values 🙂. But I think that this isn't the best place to bump the dictionary index size as it is correct to say that at the file level, you cannot have more than one different value in a partition column for one record batch. It would be nicer to upcast this type downstream, when the record batches are manipulated in a way that implies that this uniqueness doesn't hold anymore (like after a concat op). Also, it would be even nicer if we had #1248 instead 😄

If we find that it is too complex to do it downstream, I am not firmly opposed to upcast the type here, but then I agree with @yjshen that u16 isn't really enough. Also, making it customizable introduces some tuning complexity that isn't really ideal either.

alamb · 2022-03-03T15:03:03Z

@rdettai what would you think about merging this PR as a temporary workaround for common cases (like days of the year) and filing a ticket (I am happy to do so) to track the more optimal behavior?

Igosuki · 2022-03-03T18:10:59Z

I mean I can change it to u32 if that floats your boat.

alamb · 2022-03-05T21:36:58Z

In order to unstick this PR I plan to file a follow on ticket to add a more sophisticated handling of dictionaries and then merge this PR in as a workaround until it is done

alamb · 2022-03-05T22:01:08Z

Filed #1931 for follow on work ; Thanks again @Igosuki and @rdettai

Igosuki added 2 commits February 18, 2022 01:40

Increase partition column data type dictionnary key size to 16 bits

236cd8b

Double buffer size for partitioning dict keys

df22cab

github-actions bot added the datafusion Changes in the datafusion crate label Feb 18, 2022

alamb changed the title ~~Fix dict key size~~ Increase default partition column type from Dict(UInt8) to Dict(UInt16) Mar 1, 2022

alamb approved these changes Mar 1, 2022

View reviewed changes

alamb mentioned this pull request Mar 5, 2022

More efficient Dictionary / constant encoding for partition values in ListingFileProvider #1931

Open

alamb merged commit 6cc9916 into apache:master Mar 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase default partition column type from Dict(UInt8) to Dict(UInt16) #1860

Increase default partition column type from Dict(UInt8) to Dict(UInt16) #1860

Igosuki commented Feb 18, 2022

yjshen commented Feb 18, 2022

Igosuki commented Feb 18, 2022

alamb left a comment

rdettai commented Mar 2, 2022

alamb commented Mar 2, 2022

alamb commented Mar 2, 2022

Igosuki commented Mar 2, 2022

Igosuki commented Mar 2, 2022

rdettai commented Mar 3, 2022 •

edited

Loading

alamb commented Mar 3, 2022

Igosuki commented Mar 3, 2022

alamb commented Mar 5, 2022

alamb commented Mar 5, 2022

Increase default partition column type from Dict(UInt8) to Dict(UInt16) #1860

Increase default partition column type from Dict(UInt8) to Dict(UInt16) #1860

Conversation

Igosuki commented Feb 18, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

yjshen commented Feb 18, 2022

Igosuki commented Feb 18, 2022

alamb left a comment

Choose a reason for hiding this comment

rdettai commented Mar 2, 2022

alamb commented Mar 2, 2022

alamb commented Mar 2, 2022

Igosuki commented Mar 2, 2022

Igosuki commented Mar 2, 2022

rdettai commented Mar 3, 2022 • edited Loading

alamb commented Mar 3, 2022

Igosuki commented Mar 3, 2022

alamb commented Mar 5, 2022

alamb commented Mar 5, 2022

rdettai commented Mar 3, 2022 •

edited

Loading