Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Optimize GC.AllocateUninitializedArray and use it in StringBuilder #27364

Merged

Conversation

adamsitnik
Copy link
Member

I wanted to use GC.AllocateUninitializedArray in StringBuilder, but it was initially too slow. Calling it for small buffers was causing quite noticeable performance degradation.

I've tuned it up to ensure that it does not slow down the StringBuilder in "unlucky path" (small arrays) and does improve the perf in "lucky path" (big arrays). It should make this API more profitable to use in other places in the future.

Changes:

  • remove one branch by changing the int to uint (if (length < 0)). This changes the behavior of this internal API for length < 0 - previously the caller would get IndexOutOfRangeException. Since this is an internal API, I hope it's OK.
  • increase the threshold from 256 to 2048 bytes - please see the results below.
  • enforce inlining of GC.AllocateUninitializedArray, move the expensive native call to separate method to not increase the size too much.

Micro benchmarks for the GC API:

[GenericTypeArguments(typeof(byte))]
[GenericTypeArguments(typeof(char))]
[GenericTypeArguments(typeof(object))]
public class Perf_GC<T>
{
    private readonly Func<int, T[]> _allocateUninitializedArrayDelegate = CreateDelegate<int>(typeof(GC), "AllocateUninitializedArray");
    private readonly Func<int, T[]> _allocateArrayDelegate = CreateDelegate<int>(typeof(Mimic), "AllocateArray");

    [Params(256, 256 * 2, 256 * 3, 256 * 4, 256 * 6, 256 * 8)]
    public int Length;

    [Benchmark]
    public T[] AllocateUninitializedArray() => _allocateUninitializedArrayDelegate(Length);

    [Benchmark]
    public T[] AllocateArray() => _allocateArrayDelegate(Length); // using delegate for apples to apples comparison

    private static Func<N, T[]> CreateDelegate<N>(Type type, string methodName)
    {
        // this method is not a part of .NET Standard so we need to use reflection
        var method = type
            .GetMethod(methodName, BindingFlags.NonPublic | BindingFlags.Static)
            .MakeGenericMethod(typeof(T));

        return method != null ? (Func<N, T[]>)method.CreateDelegate(typeof(Func<N, T[]>)) : null;
    }
}

public static class Mimic
{
    internal static T[] AllocateArray<T>(int size) => new T[size];
}

I've simplified the default BDN output to make it easier to compare the results. In the table below the "Before" is the execution time for GC.AllocateUninitializedArray before my changes, in the "After" are with my changes. The new T[] contains the time for calling new operator (to have some base comparison)

Type Length Before After new T[]
Byte 256 78.63 ns 18.31 ns 18.17 ns
Char 256 79.33 ns 31.95 ns 31.66 ns
Object 256 113.34 ns 113.34 ns 113.03 ns
Byte 512 79.37 ns 31.38 ns 31.75 ns
Char 512 87.60 ns 58.12 ns 57.71 ns
Object 512 229.02 ns 229.30 ns 227.78 ns
Byte 768 83.24 ns 45.51 ns 45.85 ns
Char 768 95.92 ns 85.20 ns 84.34 ns
Object 768 353.66 ns 347.39 ns 349.48 ns
Byte 1024 85.99 ns 58.31 ns 57.58 ns
Char 1024 99.46 ns 100.62 ns 112.01 ns
Object 1024 457.07 ns 455.94 ns 457.47 ns
Byte 1536 92.40 ns 84.84 ns 84.44 ns
Char 1536 111.75 ns 112.97 ns 168.02 ns
Object 1536 653.64 ns 649.47 ns 643.37 ns
Byte 2048 100.61 ns 101.04 ns 111.81 ns
Char 2048 126.52 ns 125.31 ns 226.94 ns
Object 2048 830.92 ns 838.90 ns 836.48 ns

@adamsitnik
Copy link
Member Author

adamsitnik commented Oct 22, 2019

The StringBuilder benchmarks (that show a difference, I did not include the benchmarks for which the results have not changed):

public class Perf_StringBuilder
{
    const int LOHAllocatedStringSize = 100_000;

    private string _stringLOH = new string('a', LOHAllocatedStringSize);
    private string _string100 = new string('a', 100);

    [Benchmark]
    [Arguments(100)]
    [Arguments(LOHAllocatedStringSize)]
    public StringBuilder ctor_string(int length) => new StringBuilder(length == 100 ? _string100 : _stringLOH);

    [Benchmark]
    [Arguments(100)]
    [Arguments(LOHAllocatedStringSize)]
    public StringBuilder ctor_capacity(int length) => new StringBuilder(length);

    [Benchmark]
    [Arguments(100)]
    [Arguments(LOHAllocatedStringSize)]
    public StringBuilder Append_Char_Capacity(int length)
    {
        StringBuilder builder = new StringBuilder(length);

        for (int i = 0; i < length; i++)
        {
            builder.Append('a');
        }

        return builder;
    }

    [Benchmark]
    [Arguments(1)]
    [Arguments(1_000)]
    public StringBuilder Append_Strings(int repeat)
    {
        StringBuilder builder = new StringBuilder();

        // strings are not sorted by length to mimic real input
        for (int i = 0; i < repeat; i++)
        {
            builder.Append("12345");
            builder.Append("1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMN");
            builder.Append("1234567890abcdefghijklmnopqrstuvwxy");
            builder.Append("1234567890");
            builder.Append("1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHI");
            builder.Append("1234567890abcde");
            builder.Append("1234567890abcdefghijklmnopqrstuvwxyzABCD");
            builder.Append("1234567890abcdefghijklmnopqrst");
            builder.Append("1234567890abcdefghij");
            builder.Append("1234567890abcdefghijklmno");
        }

        return builder;
    }
}
Method Toolchain repeat length Mean Ratio Allocated
Append_Strings after 1000 ? 119,591.06 ns 0.97 559264 B
Append_Strings before 1000 ? 122,728.26 ns 1.00 559264 B
ctor_string after ? 100000 16,553.47 ns 0.72 200072 B
ctor_string before ? 100000 23,078.37 ns 1.00 200072 B
ctor_capacity after ? 100000 6,276.58 ns 0.48 200072 B
ctor_capacity before ? 100000 13,019.77 ns 1.00 200072 B
Append_Char_Capacity after ? 100000 219,339.17 ns 0.97 200072 B
Append_Char_Capacity before ? 100000 226,150.72 ns 1.00 200072 B

@adamsitnik adamsitnik added the tenet-performance Performance related issue label Oct 22, 2019
@adamsitnik
Copy link
Member Author

BTW the native call path could benefit a lot from https://github.com/dotnet/coreclr/issues/5329, I am going to add a comment there

@stephentoub
Copy link
Member

@adamsitnik, I don't understand the StringBuilder benchmarks: these all show them getting slower. What am I missing?

@adamsitnik
Copy link
Member Author

adamsitnik commented Oct 22, 2019

I don't understand the StringBuilder benchmarks: these all show them getting slower. What am I missing?

I am sorry when I was hand-editing the path to CoreRun to replace it with "before" and "after" I've introduced a bug ;p

Edit: fixed. Thanks for catching the bug!

@VSadov
Copy link
Member

VSadov commented Oct 22, 2019

Another interesting case would be if an uninitialized allocation is mixed with ordinary allocation. Allocating one object after each array should do it.
I’d expect the breakeven threshold to be higher for such case.

@jkotas
Copy link
Member

jkotas commented Oct 22, 2019

@GrabYourPitchforks @bartonjs Thoughts about this from the security point of view?

@GrabYourPitchforks
Copy link
Member

I can't think of anything offhand. As long as there's no way for a caller to observe the uninitialized data, you should be good. For instance, StringBuilder.set_Length should continue to zero-fill the padding to overwrite any previously existing data. StringBuilder.GetChunks should continue to return an enumerator whose ReadOnlyMemory<char> elements are properly sliced. (If a caller then uses MemoryMarshal or unsafe code to get at the rest of the data, they're taking responsibility for their own actions.)

@adamsitnik
Copy link
Member Author

Another interesting case would be if an uninitialized allocation is mixed with ordinary allocation. Allocating one object after each array should do it.
I’d expect the breakeven threshold to be higher for such case.

I've modified the benchmark to include single object allocation before allocating the array and the threshold grew to 7kb.

@VSadov
Copy link
Member

VSadov commented Oct 23, 2019

7k - is that with int parameter or uint?

@bartonjs
Copy link
Member

In general I'm not a fan of "malloc vs calloc"; but if we have a plethora of tests then we can feel pretty confident that we're never acting on the uninitialized data (either StringBuilder internally or it being returned to callers).

@adamsitnik
Copy link
Member Author

7k - is that with int parameter or uint?

int

@adamsitnik
Copy link
Member Author

@jkotas using Unsafe.As over a cast helped:

Before (using cast):

obraz

obraz

After (using Unsafe.As):

obraz

obraz

Thanks for another great suggestion!

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the CI is green.

… AllocateSzArray is responsible for handling negative size
@adamsitnik
Copy link
Member Author

once the CI is green

I've removed following length check from AllocateNewArray:

PRECONDITION(length >= 0);

Because AllocateSzArray is responsible for throwing the exception for negative length:

if (cElements < 0)
COMPlusThrow(kOverflowException);

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants