Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of DictionaryArray::try_new() #1313

Closed
alamb opened this issue Feb 15, 2022 · 4 comments · Fixed by #1435
Closed

Improve performance of DictionaryArray::try_new() #1313

alamb opened this issue Feb 15, 2022 · 4 comments · Fixed by #1435
Assignees
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers performance

Comments

@alamb
Copy link
Contributor

alamb commented Feb 15, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

#1300 adds DictionaryArray::try_new() for creating DictionaryArrays from existing keys and dictionaries. However, as noted by @jhorstmann in #1300 (comment) this results in a double validation (that the keys and values are valid arrays, when we know they already are, in addition to validating that all the keys are valid indexes into the dictionary.

Describe the solution you'd like
Minimize the checking in DictionaryArray::try_new() to only verify that all indexes are within the range of the dictionary, which is required for this to be a safe API.

It also might be worth adding a unsafe DictionaryArray::new_unchecked() which builds the arrays from data the user asserts is valid.

@alamb alamb added good first issue Good for newcomers arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog performance labels Feb 15, 2022
@jackwener
Copy link
Member

I can take it up. Could you assign it me? @alamb

@jhorstmann
Copy link
Contributor

@jackwener I just opened a PR that does a small change to try_new in #1430, shouldn't cause any conflicts I think

@jackwener
Copy link
Member

@jhorstmann Hello, I am newer for project, I am gradually trying to contribute.
I can't get the point double validation. Could you give me some hint? Thanks !❤

@jhorstmann
Copy link
Contributor

ArrayDataBuilder::build calls ArrayData::try_new, which does a full validation of that struct, including all child ArrayData objects. For example building a DictionaryArray from Int32 keys to String values first validates that the child data containing these strings is valid. Since layout of a StringArray consists of an offset buffer and a data buffer, it has to check that the offsets are monotonically increasing and in bounds for the data buffer. When the child StringArray is valid then also the dictionary keys need to be validated to be in bounds of this StringArray.

In DictionaryArray::try_new, we get the child array as a parameter and can assume that is is valid without having to validate it again. We only need to check that the given keys are in bounds.

I think a possible solution would be to extract the dictionary validation logic out of ArrayData::validate_full into a separate function. DictionaryArray::try_new could then use ArrayDataBuilder::build_unchecked and afterwards call the new function which only validates that the keys are in bounds.

jackwener added a commit to jackwener/arrow-rs that referenced this issue Mar 13, 2022
jackwener added a commit to jackwener/arrow-rs that referenced this issue Mar 22, 2022
alamb added a commit that referenced this issue Mar 22, 2022
* improve `DictionaryArray::try_new()` #1313

* *: fix typo

* *: add cheap validate and unit test

* *: polish the error

* Add safety note

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants