Improve performance of `DictionaryArray::try_new()` #1313

alamb · 2022-02-15T20:11:45Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

#1300 adds DictionaryArray::try_new() for creating DictionaryArrays from existing keys and dictionaries. However, as noted by @jhorstmann in #1300 (comment) this results in a double validation (that the keys and values are valid arrays, when we know they already are, in addition to validating that all the keys are valid indexes into the dictionary.

Describe the solution you'd like
Minimize the checking in DictionaryArray::try_new() to only verify that all indexes are within the range of the dictionary, which is required for this to be a safe API.

It also might be worth adding a unsafe DictionaryArray::new_unchecked() which builds the arrays from data the user asserts is valid.

The text was updated successfully, but these errors were encountered:

jackwener · 2022-03-10T03:06:13Z

I can take it up. Could you assign it me? @alamb

jhorstmann · 2022-03-11T18:21:07Z

@jackwener I just opened a PR that does a small change to try_new in #1430, shouldn't cause any conflicts I think

jackwener · 2022-03-12T03:23:05Z

@jhorstmann Hello, I am newer for project, I am gradually trying to contribute.
I can't get the point double validation. Could you give me some hint? Thanks !❤

jhorstmann · 2022-03-12T08:55:38Z

ArrayDataBuilder::build calls ArrayData::try_new, which does a full validation of that struct, including all child ArrayData objects. For example building a DictionaryArray from Int32 keys to String values first validates that the child data containing these strings is valid. Since layout of a StringArray consists of an offset buffer and a data buffer, it has to check that the offsets are monotonically increasing and in bounds for the data buffer. When the child StringArray is valid then also the dictionary keys need to be validated to be in bounds of this StringArray.

In DictionaryArray::try_new, we get the child array as a parameter and can assume that is is valid without having to validate it again. We only need to check that the given keys are in bounds.

I think a possible solution would be to extract the dictionary validation logic out of ArrayData::validate_full into a separate function. DictionaryArray::try_new could then use ArrayDataBuilder::build_unchecked and afterwards call the new function which only validates that the keys are in bounds.

* improve `DictionaryArray::try_new()` #1313 * *: fix typo * *: add cheap validate and unit test * *: polish the error * Add safety note Co-authored-by: Andrew Lamb <[email protected]>

alamb added good first issue Good for newcomers arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog performance labels Feb 15, 2022

alamb mentioned this issue Feb 15, 2022

Add DictionaryArray::try_new() to create dictionaries from pre existing arrays #1300

Merged

alamb assigned jackwener Mar 10, 2022

jackwener added a commit to jackwener/arrow-rs that referenced this issue Mar 13, 2022

improve DictionaryArray::try_new() apache#1313

b9b79cf

jackwener mentioned this issue Mar 13, 2022

Improve performance of DictionaryArray::try_new() #1435

Merged

jackwener added a commit to jackwener/arrow-rs that referenced this issue Mar 22, 2022

improve DictionaryArray::try_new() apache#1313

b326af6

alamb closed this as completed in #1435 Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `DictionaryArray::try_new()` #1313

Improve performance of `DictionaryArray::try_new()` #1313

alamb commented Feb 15, 2022

jackwener commented Mar 10, 2022

jhorstmann commented Mar 11, 2022

jackwener commented Mar 12, 2022

jhorstmann commented Mar 12, 2022

Improve performance of DictionaryArray::try_new() #1313

Improve performance of DictionaryArray::try_new() #1313

Comments

alamb commented Feb 15, 2022

jackwener commented Mar 10, 2022

jhorstmann commented Mar 11, 2022

jackwener commented Mar 12, 2022

jhorstmann commented Mar 12, 2022

Improve performance of `DictionaryArray::try_new()` #1313

Improve performance of `DictionaryArray::try_new()` #1313