Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AK: Replace Unicode validation, conversion, and length computation with simdutf #674

Merged
merged 5 commits into from
Jul 18, 2024

Conversation

trflynn89
Copy link
Contributor

@trflynn89 trflynn89 commented Jul 16, 2024

We brought simdutf in with base64 transcodings. It also provides a wide range of Unicode utilities that we can easily use, and they are much more performant than our implementations.

There is still room on the table for more performance improvements, marked in the code as FIXMEs for now. The main reason these aren't as fast as they can be is that our Unicode views inject U+FFFD as a replacement character on invalid code points. So we have to check for validity and handle that case ourselves, as simdutf largely only works with valid encodings.

UTF-8 Benchmark

Benchmark code

static auto string = String::repeated("abc😀def"_string, 1'000'000).release_value();

BENCHMARK_CASE(bench_validate)
{
    for (size_t i = 0; i < 1'000; ++i)
        EXPECT(string.code_points().validate());
}

BENCHMARK_CASE(bench_length)
{
    for (size_t i = 0; i < 1'000; ++i)
        EXPECT_EQ(string.code_points().length(), 7'000'000u);
}

Validation Length
AK 9.145s 4.698s
simdutf 0.460s 0.582s
20x faster 8x faster

UTF-16 Benchmark

Benchmark code

static Vector<u32> to_utf32(String const& string)
{
    Vector<u32> code_points;
    for (auto code_point : string.code_points())
        code_points.append(code_point);
    return code_points;
}

static auto utf8_string = String::repeated("abc😀def"_string, 1'000'000).release_value();
static auto utf16_string = AK::utf8_to_utf16(utf8_string).release_value();
static auto utf32_string = to_utf32(utf8_string);

BENCHMARK_CASE(bench_validate)
{
    for (size_t i = 0; i < 1'000; ++i)
        EXPECT(Utf16View { utf16_string }.validate());
}

BENCHMARK_CASE(bench_length)
{
    for (size_t i = 0; i < 1'000; ++i)
        EXPECT_EQ(Utf16View { utf16_string }.length_in_code_points(), 7'000'000u);
}

BENCHMARK_CASE(bench_utf16_to_utf8)
{
    for (size_t i = 0; i < 100; ++i)
        MUST(Utf16View { utf16_string }.to_utf8());
}

BENCHMARK_CASE(bench_utf8_to_utf16)
{
    for (size_t i = 0; i < 100; ++i)
        MUST(AK::utf8_to_utf16(utf8_string));
}

BENCHMARK_CASE(bench_utf32_to_utf16)
{
    for (size_t i = 0; i < 100; ++i)
        MUST(AK::utf32_to_utf16({ utf32_string.data(), utf32_string.size() }));
}

Validation Length To UTF-8 From UTF-8 From UTF-32
AK 3.549s 8.624s 5.323s 3.899s 2.158s
simdutf 0.263s 0.576s 0.550s 0.663s 0.594s
13x faster 15x faster 10x faster 6x faster 4x faster

@trflynn89
Copy link
Contributor Author

The ASAN error is in this test-js test in builtins/Array/Array.prototype.flat.js

test("Issue #9317, stack overflow in flatten_into_array from flat call", () => {
    var a = [];
    a[0] = a;
    expect(() => {
        a.flat(3893232121);
    }).toThrowWithMessage(InternalError, "Call stack size limit exceeded");
});

We seem to be hitting ASAN's stack overflow just before that test reaches the VM's stack limit (we would throw the InternalError once we reach 32k free, we have 33.5 free when ASAN aborts).

AK/Utf8View.cpp Outdated Show resolved Hide resolved
AK/Utf16View.cpp Show resolved Hide resolved
AK will depend on some vcpkg dependencies, so the Lagom tools build will
need to know how to use vcpkg. We can do this by sym-linking vcpkg.json
to Meta/Lagom (as vcpkg.json has to be in the CMake source directory).
We also need a CMakePresets.json in the source directory, which can just
include the root file. The root CMakePresets then needs to define paths
relative to ${fileDir} rather than ${sourceDir}.
The one behavior difference is that we will now actually fail on invalid
code units with Utf16View::to_utf8(AllowInvalidCodeUnits::No). It was
arguably a bug that this wasn't already the case.
@awesomekling awesomekling merged commit 0c14a94 into LadybirdBrowser:master Jul 18, 2024
6 checks passed
@trflynn89 trflynn89 deleted the utf branch July 18, 2024 12:47
nico pushed a commit to nico/serenity that referenced this pull request Oct 17, 2024
Currently, invoking StringBuilder::to_string will re-allocate the string
data to construct the String. This is wasteful both in terms of memory
and speed.

The goal here is to simply hand the string buffer over to String, and
let String take ownership of that buffer. To do this, StringBuilder must
have the same memory layout as Detail::StringData. This layout is just
the members of the StringData class followed by the string itself.

So when a StringBuilder is created, we reserve sizeof(StringData) bytes
at the front of the buffer. StringData can then construct itself into
the buffer with placement new.

Things to note:
* StringData must now be aware of the actual capacity of its buffer, as
  that can be larger than the string size.
* We must take care not to pass ownership of inlined string buffers, as
  these live on the stack.

(cherry picked from commit 29879a69a4b2eda4f0315027cb1e86964d333221;
amended minor conflict in AK/String.h due to us not having
String::from_utf16() from LadybirdBrowser/ladybird#674, last commit)
Hendiadyoin1 pushed a commit to Hendiadyoin1/serenity that referenced this pull request Nov 9, 2024
Currently, invoking StringBuilder::to_string will re-allocate the string
data to construct the String. This is wasteful both in terms of memory
and speed.

The goal here is to simply hand the string buffer over to String, and
let String take ownership of that buffer. To do this, StringBuilder must
have the same memory layout as Detail::StringData. This layout is just
the members of the StringData class followed by the string itself.

So when a StringBuilder is created, we reserve sizeof(StringData) bytes
at the front of the buffer. StringData can then construct itself into
the buffer with placement new.

Things to note:
* StringData must now be aware of the actual capacity of its buffer, as
  that can be larger than the string size.
* We must take care not to pass ownership of inlined string buffers, as
  these live on the stack.

(cherry picked from commit 29879a69a4b2eda4f0315027cb1e86964d333221;
amended minor conflict in AK/String.h due to us not having
String::from_utf16() from LadybirdBrowser/ladybird#674, last commit)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants