-
-
Notifications
You must be signed in to change notification settings - Fork 989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AK: Replace Unicode validation, conversion, and length computation with simdutf #674
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
trflynn89
force-pushed
the
utf
branch
2 times, most recently
from
July 16, 2024 21:01
73f5917
to
001cb73
Compare
The ASAN error is in this test-js test in builtins/Array/Array.prototype.flat.js test("Issue #9317, stack overflow in flatten_into_array from flat call", () => {
var a = [];
a[0] = a;
expect(() => {
a.flat(3893232121);
}).toThrowWithMessage(InternalError, "Call stack size limit exceeded");
}); We seem to be hitting ASAN's stack overflow just before that test reaches the VM's stack limit (we would throw the InternalError once we reach 32k free, we have 33.5 free when ASAN aborts). |
ADKaster
reviewed
Jul 17, 2024
trflynn89
force-pushed
the
utf
branch
3 times, most recently
from
July 17, 2024 17:48
45ae49e
to
5e768e8
Compare
AK will depend on some vcpkg dependencies, so the Lagom tools build will need to know how to use vcpkg. We can do this by sym-linking vcpkg.json to Meta/Lagom (as vcpkg.json has to be in the CMake source directory). We also need a CMakePresets.json in the source directory, which can just include the root file. The root CMakePresets then needs to define paths relative to ${fileDir} rather than ${sourceDir}.
The one behavior difference is that we will now actually fail on invalid code units with Utf16View::to_utf8(AllowInvalidCodeUnits::No). It was arguably a bug that this wasn't already the case.
nico
pushed a commit
to nico/serenity
that referenced
this pull request
Oct 17, 2024
Currently, invoking StringBuilder::to_string will re-allocate the string data to construct the String. This is wasteful both in terms of memory and speed. The goal here is to simply hand the string buffer over to String, and let String take ownership of that buffer. To do this, StringBuilder must have the same memory layout as Detail::StringData. This layout is just the members of the StringData class followed by the string itself. So when a StringBuilder is created, we reserve sizeof(StringData) bytes at the front of the buffer. StringData can then construct itself into the buffer with placement new. Things to note: * StringData must now be aware of the actual capacity of its buffer, as that can be larger than the string size. * We must take care not to pass ownership of inlined string buffers, as these live on the stack. (cherry picked from commit 29879a69a4b2eda4f0315027cb1e86964d333221; amended minor conflict in AK/String.h due to us not having String::from_utf16() from LadybirdBrowser/ladybird#674, last commit)
Hendiadyoin1
pushed a commit
to Hendiadyoin1/serenity
that referenced
this pull request
Nov 9, 2024
Currently, invoking StringBuilder::to_string will re-allocate the string data to construct the String. This is wasteful both in terms of memory and speed. The goal here is to simply hand the string buffer over to String, and let String take ownership of that buffer. To do this, StringBuilder must have the same memory layout as Detail::StringData. This layout is just the members of the StringData class followed by the string itself. So when a StringBuilder is created, we reserve sizeof(StringData) bytes at the front of the buffer. StringData can then construct itself into the buffer with placement new. Things to note: * StringData must now be aware of the actual capacity of its buffer, as that can be larger than the string size. * We must take care not to pass ownership of inlined string buffers, as these live on the stack. (cherry picked from commit 29879a69a4b2eda4f0315027cb1e86964d333221; amended minor conflict in AK/String.h due to us not having String::from_utf16() from LadybirdBrowser/ladybird#674, last commit)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We brought simdutf in with base64 transcodings. It also provides a wide range of Unicode utilities that we can easily use, and they are much more performant than our implementations.
There is still room on the table for more performance improvements, marked in the code as FIXMEs for now. The main reason these aren't as fast as they can be is that our Unicode views inject U+FFFD as a replacement character on invalid code points. So we have to check for validity and handle that case ourselves, as simdutf largely only works with valid encodings.
UTF-8 Benchmark
Benchmark code
UTF-16 Benchmark
Benchmark code