AK: Replace Unicode validation, conversion, and length computation with simdutf #674

trflynn89 · 2024-07-16T20:52:18Z

We brought simdutf in with base64 transcodings. It also provides a wide range of Unicode utilities that we can easily use, and they are much more performant than our implementations.

There is still room on the table for more performance improvements, marked in the code as FIXMEs for now. The main reason these aren't as fast as they can be is that our Unicode views inject U+FFFD as a replacement character on invalid code points. So we have to check for validity and handle that case ourselves, as simdutf largely only works with valid encodings.

UTF-8 Benchmark

Benchmark code

static auto string = String::repeated("abc😀def"_string, 1'000'000).release_value();

BENCHMARK_CASE(bench_validate)
{
    for (size_t i = 0; i < 1'000; ++i)
        EXPECT(string.code_points().validate());
}

BENCHMARK_CASE(bench_length)
{
    for (size_t i = 0; i < 1'000; ++i)
        EXPECT_EQ(string.code_points().length(), 7'000'000u);
}

	Validation	Length
AK	9.145s	4.698s
simdutf	0.460s	0.582s
	20x faster	8x faster

UTF-16 Benchmark

Benchmark code

static Vector<u32> to_utf32(String const& string)
{
    Vector<u32> code_points;
    for (auto code_point : string.code_points())
        code_points.append(code_point);
    return code_points;
}

static auto utf8_string = String::repeated("abc😀def"_string, 1'000'000).release_value();
static auto utf16_string = AK::utf8_to_utf16(utf8_string).release_value();
static auto utf32_string = to_utf32(utf8_string);

BENCHMARK_CASE(bench_validate)
{
    for (size_t i = 0; i < 1'000; ++i)
        EXPECT(Utf16View { utf16_string }.validate());
}

BENCHMARK_CASE(bench_length)
{
    for (size_t i = 0; i < 1'000; ++i)
        EXPECT_EQ(Utf16View { utf16_string }.length_in_code_points(), 7'000'000u);
}

BENCHMARK_CASE(bench_utf16_to_utf8)
{
    for (size_t i = 0; i < 100; ++i)
        MUST(Utf16View { utf16_string }.to_utf8());
}

BENCHMARK_CASE(bench_utf8_to_utf16)
{
    for (size_t i = 0; i < 100; ++i)
        MUST(AK::utf8_to_utf16(utf8_string));
}

BENCHMARK_CASE(bench_utf32_to_utf16)
{
    for (size_t i = 0; i < 100; ++i)
        MUST(AK::utf32_to_utf16({ utf32_string.data(), utf32_string.size() }));
}

	Validation	Length	To UTF-8	From UTF-8	From UTF-32
AK	3.549s	8.624s	5.323s	3.899s	2.158s
simdutf	0.263s	0.576s	0.550s	0.663s	0.594s
	13x faster	15x faster	10x faster	6x faster	4x faster

trflynn89 · 2024-07-17T01:16:28Z

The ASAN error is in this test-js test in builtins/Array/Array.prototype.flat.js

test("Issue #9317, stack overflow in flatten_into_array from flat call", () => {
    var a = [];
    a[0] = a;
    expect(() => {
        a.flat(3893232121);
    }).toThrowWithMessage(InternalError, "Call stack size limit exceeded");
});

We seem to be hitting ASAN's stack overflow just before that test reaches the VM's stack limit (we would throw the InternalError once we reach 32k free, we have 33.5 free when ASAN aborts).

AK/Utf8View.cpp

AK/Utf16View.cpp

AK will depend on some vcpkg dependencies, so the Lagom tools build will need to know how to use vcpkg. We can do this by sym-linking vcpkg.json to Meta/Lagom (as vcpkg.json has to be in the CMake source directory). We also need a CMakePresets.json in the source directory, which can just include the root file. The root CMakePresets then needs to define paths relative to ${fileDir} rather than ${sourceDir}.

The one behavior difference is that we will now actually fail on invalid code units with Utf16View::to_utf8(AllowInvalidCodeUnits::No). It was arguably a bug that this wasn't already the case.

Currently, invoking StringBuilder::to_string will re-allocate the string data to construct the String. This is wasteful both in terms of memory and speed. The goal here is to simply hand the string buffer over to String, and let String take ownership of that buffer. To do this, StringBuilder must have the same memory layout as Detail::StringData. This layout is just the members of the StringData class followed by the string itself. So when a StringBuilder is created, we reserve sizeof(StringData) bytes at the front of the buffer. StringData can then construct itself into the buffer with placement new. Things to note: * StringData must now be aware of the actual capacity of its buffer, as that can be larger than the string size. * We must take care not to pass ownership of inlined string buffers, as these live on the stack. (cherry picked from commit 29879a69a4b2eda4f0315027cb1e86964d333221; amended minor conflict in AK/String.h due to us not having String::from_utf16() from LadybirdBrowser/ladybird#674, last commit)

trflynn89 force-pushed the utf branch 2 times, most recently from 73f5917 to 001cb73 Compare July 16, 2024 21:01

ADKaster reviewed Jul 17, 2024

View reviewed changes

AK/Utf8View.cpp Outdated Show resolved Hide resolved

AK/Utf16View.cpp Show resolved Hide resolved

trflynn89 force-pushed the utf branch 3 times, most recently from 45ae49e to 5e768e8 Compare July 17, 2024 17:48

trflynn89 added 5 commits July 17, 2024 14:53

AK: Remove Lagom tools workaround for simdutf

cace0cd

AK: Replace UTF-8 validation and length computation with simdutf

fe09f7d

AK: Replace UTF-16 validation and length computation with simdutf

e45bc64

AK: Replace converting to and from UTF-16 with simdutf

e2b7b0b

The one behavior difference is that we will now actually fail on invalid code units with Utf16View::to_utf8(AllowInvalidCodeUnits::No). It was arguably a bug that this wasn't already the case.

trflynn89 force-pushed the utf branch from 5e768e8 to e2b7b0b Compare July 17, 2024 18:54

awesomekling merged commit 0c14a94 into LadybirdBrowser:master Jul 18, 2024
6 checks passed

trflynn89 deleted the utf branch July 18, 2024 12:47

trflynn89 mentioned this pull request Jul 18, 2024

AK+LibTextCodec: Use AK facilities to validate and convert UTF-16 to UTF-8 #698

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AK: Replace Unicode validation, conversion, and length computation with simdutf #674

AK: Replace Unicode validation, conversion, and length computation with simdutf #674

trflynn89 commented Jul 16, 2024 •

edited

Loading

trflynn89 commented Jul 17, 2024

AK: Replace Unicode validation, conversion, and length computation with simdutf #674

AK: Replace Unicode validation, conversion, and length computation with simdutf #674

Conversation

trflynn89 commented Jul 16, 2024 • edited Loading

trflynn89 commented Jul 17, 2024

trflynn89 commented Jul 16, 2024 •

edited

Loading