-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide "optimal" encoding #4218
Comments
Do you have methods do decide which encoding is the "optimal" encoding? Since it might both require sampling data, and some heuristic or other methods. Do you have any idea or formula here? |
This isn't quite correct, for a V1 writer the default encoding is RLE_DICTIONARY falling back to PLAIN on exceeding the dictionary page size. There are no other non-deprecated encodings supported by the V1 spec. For a V2 writer, the defaults are similar but falling back to DELTA_BYTE_ARRAY for byte array types instead of PLAIN. Perhaps you could give an example where the encoding is not as you would expect? |
https://github.com/apache/arrow-rs/blob/master/parquet/src/basic.rs#L222-L230, Sorry, It's my mistake. I saw this line before and thought it was all plain coding.
If our data is |
RLE Hybrid is used to encode level data, and dictionary indices. The default settings will therefore PLAIN encode For v2 writers there is a form of delta encoding, however, amusingly the linked paper says precisely not to do what the parquet specification then goes on to do 😆. This translates into pretty terrible decode performance, and I would not recommend using it for most workloads. |
So in the case of v2 writers, the default encoding chosen is delta instead of plain. Has this been chosen internally? Sorry, my example may not be good. If it's'
I am very interested in this paper. Can you tell me the title of the paper? I'll go study :D |
|
For v2 the dictionary fallback is https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6 for byte arrays, and PLAIN for everything else. https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5 is never used by default. Ultimately DICTIONARY falling back to PLAIN is very fast and well supported, and the space efficiency is good enough for most workloads, alternatives have a hard task to drive broad ecosystem adoption. Ultimately you can always do better than parquet, but people use parquet because it is good enough and well supported |
So in summary, if we want to write arrow's memory data to the parquet file, we generally do not need to specify encoding. Will parquet automatically help us choose a more suitable encoding? |
Correct, the defaults should be appropriate for most workloads. Some workloads may benefit from tweaking based on empirical data, e.g. smaller row groups, etc... but I would advise against premature optimisation here. |
Hi @tustvold Previously, the plain encoding of arrow2 was used, but now it has been changed to the default encoding of arrow-rs. It can be observed that the buffer written has changed, but all the changed buffers have become larger. Is this expected? |
Yes, it's a heuristic, there is no guaranteed way to know ahead of time the most efficient way to encode a given block of data. Consider the case of no repeated values, dictionary encoding will be larger. It will fallback to PLAIN encoding once the dictionary page is full (1 MB) but for very small columns with low repetition, it is highly probable the encoding will be larger. |
May I ask where the logical code for this section is located? |
If I want to choose some encoding based on the data characteristics in the upper layer application, such as delta. Are there any previous studies that can be used for reference? |
I'm not aware of any, but would be interested should you find such information, we just follow the example of the other parquet writers like parquet-mr. I suspect if you have a cardinality estimation of the input you can make a fairly good guess as to whether dictionary encoding is valuable. If your application is really sensitive to storage size, you could consider lowering the max dictionary page size, so that fallback triggers earlier, or possibly explore the block compression options. Alternatively, if you wanted to contribute a PR that would optionally re-encode on fallback, instead of preserving what has already been dictionary encoded, I would be willing to review it |
Thank you. If I find some useful information, I will share it. |
Parquet supports many types of encoding. If we can provide "optimal" encoding, e.g. by default, the most suitable encoding will be selected based on the characteristics of the data, rather than letting users choose. Currently, the default encoding is plain, which is not a good way. If the user needs to choose encoding based on data characteristics, the requirements for the user are relatively high.
Originally posted by @tustvold in issues: Non-Goals
The text was updated successfully, but these errors were encountered: