Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support using simdutf8 for validate_string_view and other utf8 validation #7014

Open
Dandandan opened this issue Jan 24, 2025 · 7 comments
Open
Labels
enhancement Any new improvement worthy of a entry in the changelog performance

Comments

@Dandandan
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

We have some different sources were std::str::from_utf8 is used for utf8 validation, such as validate_string_view.
We can speed it up by supporting simdutf8 as well here.

Describe the solution you'd like

Similar for #6668 , optionally switch to using simdutf8 instead.

Describe alternatives you've considered

Additional context

@etseidl
Copy link
Contributor

etseidl commented Jan 24, 2025

I mentioned this in #6668, but I was thinking we could just globally replace calls to from_utf8 with an internal version. Something like

#[cfg(feature = "simdutf8")]
#[inline(always)]
pub fn from_utf8(val: &[u8]) -> Result<&str, simdutf8::compat::Utf8Error> {
    match simdutf8::basic::from_utf8(val) {
        Ok(result) => Ok(result),
        Err(_) => simdutf8::compat::from_utf8(val),
    }
}

#[cfg(not(feature = "simdutf8"))]
#[inline(always)]
pub fn from_utf8(val: &[u8]) -> Result<&str, std::str::Utf8Error> {
    std::str::from_utf8(val)
}

@alamb
Copy link
Contributor

alamb commented Jan 24, 2025

I am embarassed to admit I missed the StringView one -- I will make a PR to switch

@alamb
Copy link
Contributor

alamb commented Jan 24, 2025

Update I don't think I was going crazy. I can't find any remaining hot location that does utf8 validation that I think will make any different for performance

To be clear, I don't think @etseidl 's idea is a bad one, just I don't think it will help much

@etseidl
Copy link
Contributor

etseidl commented Jan 24, 2025

I don't think it will help much

That is entirely possible. I'm still testing to see if I have any use cases that would benefit.

@alamb
Copy link
Contributor

alamb commented Jan 24, 2025

Maybe I should have used the positive rather than double negative: " @etseidl 's idea is a good one, I just don't think it will help much " 🤦

@Dandandan
Copy link
Contributor Author

Dandandan commented Jan 30, 2025

There are places that still do std::str::from_utf8 like validate_string_view.

pub fn validate_string_view(views: &[u128], buffers: &[Buffer]) -> Result<(), ArrowError> {

This still has std::str::from_utf8 ?
I got it from @alamb profile in apache/datafusion#13983 (comment) where it has ~5% of samples, so I expect that query to benefit from it by a couple of percentage points.

I think there is also some other places which can benefit by a bit from this (e.g. reading csv/json, casting...).
https://github.com/search?q=repo%3Aapache%2Farrow-rs%20std%3A%3Astr%3A%3Afrom_utf8&type=code

@Dandandan
Copy link
Contributor Author

So I think we can follow the idea of @etseidl to replace all std::str::from_utf8 calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog performance
Projects
None yet
Development

No branches or pull requests

3 participants