-
Notifications
You must be signed in to change notification settings - Fork 853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: enhance the substring
kernel
#1531
Comments
Thanks @HaoYang670 I suggest:
|
Hi @alamb. After thinking twice about the renaming, I suggest we leave the
In this way, the back compatibility will not be broken. |
Thank you for your thoughts @HaoYang670 I still think switching
Perhaps other maintainers such as @nevi-me @viirya @sunchao, @tustvold and @jhorstmann have some thoughts about if/how we should change |
Coming at the problem from a Rust perspective, as opposed to potentially a query engine perspective, I personally would expect it to behave similarly to the standard library. That is:
That being said I don't really understand the use-cases of With regards to the checked bytes API, whatever it is called, there is a bit-twiddling trick you can do to make it relatively cheap - see here. Assuming the contents of the string before were valid UTF-8, it is sufficient to just verify you aren't splitting a codepoint - you do not need to revalidate the entire array. |
I think @tustvold 's perspective matches @HaoYang670 's proposal #1531 (comment) . I am convinced by @HaoYang670 's proposal |
This is a fantastic point! Maybe the cost of checking utf-8 validation can be so cheap that we don't need the |
So I think the first step is to add the utf-8 validation checking into the current |
Hmm, @tustvold and @HaoYang670 's proposal looks good. I think It is good idea to follow up the standard library.
About the proposal, my question is, do we really need two APIs for |
Hi @viirya. In general: That's why we need 3 versions of |
Thanks @HaoYang670 . It is clear from the above list, |
We can design a situation where Suppose we want to get the substrings with the byte length of the shortest string in the array. We could do in this way by using the substring(string_array, start = 0, length = Some(min(length(string_array)))) Then, give an input let input = [
"14 bytes 🗻",
"This string has 24 bytes",
"🗻 13 bytes"
]; We will get an invalid string array if using |
We don't need to implement the |
This is an epic issue to enhance the
substring
kernel from performance, safety and supported types.You can find more info from the discussion: #1529 (comment)
List of individual tasks is below:
Performance
substring
kernel by about 2x #1512substring_by_char
kernel #1800Safety
substring
kernel #1529substring
#1577substring
kernel): The null buffer is not aligned whenoffset != 0
#1639New Features
substring
support for binary #1608substring
support forFixedSizeBinaryArray
#1618substring_by_char
kernels for slicing on character boundaries #1768substring
kernel should supportFixedSizeListArray
#1887The text was updated successfully, but these errors were encountered: