Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASCII fast path for some String scalar functions #12306

Open
3 of 6 tasks
2010YOUY01 opened this issue Sep 3, 2024 · 2 comments
Open
3 of 6 tasks

ASCII fast path for some String scalar functions #12306

2010YOUY01 opened this issue Sep 3, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@2010YOUY01
Copy link
Contributor

2010YOUY01 commented Sep 3, 2024

Is your feature request related to a problem or challenge?

String operations on UTF8 encoding are relatively more expensive, due to UTF8 being variable length encoding, and each character can be encoded with 1~4 bytes

For example, a UTF8 string "Hello🌏世界" in-memory representation is (x for 1 byte)

[x][x][x][x][x][xxxx][xxx][xxx]

Some seemingly cheap operation liks substr(utf8_col, i, j), character_length(utf8_col) will actually decode the whole string, instead of doing some O(1) operation. If we can assume one string column batch is ASCII only, then those operations are indeed cheap.

However:

  • Many data are ASCII encoded (1 Byte encoding subset of UTF8), which includes the most common English characters, numbers, etc.
  • Validating if a string array is ASCII-encoded is fast
    • Validation implementation is compiler/CPU friendly, can run ~memory bandwidth
    • It's possible to check in batch, for each string array

So it's possible to first do the check within those functions. If the string array is ASCII-only, then run the specialized path. The ASCII validation overhead should be worth the performance gain in the general cases.

This should be a common trick which has been implemented in Velox and Photon, as their paper has mentioned.
Below is the numbers from Velox
image

I did a quick experiment on character_length()/ substr() scalar functions, and got some speedup for ASCII cases, the validation overhead is very little.
substr() can get another 80% faster upon #12044, for some microbenches with string length 128B

Update

It has been done on char_length() string function and got some performance improvement #12356

The remaining tasks:

Describe the solution you'd like

For scalar functions applicable to ASCII specialization, within function implementation, first validate whether String array is ASCII only, if so enable the fast path.
Functions possible to speed up: character_length(), substr(), lower(), upper()
(And maybe some more like regex functions, need some further investigation)

Describe alternatives you've considered

Add an option to let users to specify whether a column is fully ASCII
Since the always-validate approach is easier to use, and not so expensive, we can leave this to the future

Additional context

No response

@2010YOUY01 2010YOUY01 added the enhancement New feature or request label Sep 3, 2024
@2010YOUY01
Copy link
Contributor Author

I plan to first do a PR with character_length() function

@alamb
Copy link
Contributor

alamb commented Sep 6, 2024

#12356 is quite cool 🚀

Add an option to let users to specify whether a column is fully ASCII
Since the always-validate approach is easier to use, and not so expensive, we can leave this to the future

I strongly suspect that this is what velox / photon likely do (aka mark which batches are ascii only on creation and the propagate that flag through execution).

The current implementation of arrow is_ascii examines all the bytes in the array:
https://docs.rs/arrow-array/53.0.0/src/arrow_array/array/byte_array.rs.html#262

One thing we might consider doing to improve performance is to add a is_ascii flag to StringArray and StringViewArray if the data is entirely ascii and then propagate that flag through common kernels (e.g. take / filter).

For example, once is_ascii is called once on a SrtingArray and returns true, there is no reason ti check the same data again (even if it is filtered or substr is calculated, etc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants