-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Use B+-trees instead of AVL trees for Immutable Collections #14477
Comments
I'm not very familiar with B+-trees, but my first concern is whether they make it necessary to discard more of the tree than AVL trees do on mutations. If I'm reading it correctly, @sharwell addresses that concern and suggests that gets even better (due to their being shallower trees overall?). It sounds better for memory, and we know that n-ary trees in general have better performance than AVL trees for lookups. So it sounds like this is a promising idea. |
@AArnott You are reading that correctly. For "standard" collections, the value for b is generally chosen for best interaction with the garbage collector and processor caches (e.g. relatively large pages for great locality, but intentionally small enough to keep them out of the large object heap for most expected types). I had excellent results with b set to 128 or larger for this type of data structure. In the case of Immutable Collections, I approached the analysis from the other side of minimizing the cost of mutations. |
Great. Another concern is construction performance. When we had a 4-ary tree, construction was much slower. |
Seems like a nice improvement to me. We've avoided ImmutableList in Roslyn because of the measured allocation overhead. (But we lean heavily ImmutableArray to keep us happy.) |
+1 👍 |
One other thing to keep tabs on is what it means for Builders. |
Rust language uses B-tree for maps and sets :D |
This is likely to make the mutators execution times worse (Add, Insert, etc), because of the increased footprint of the nodes to clone/alter. Would it bother you less than the overall memory consumption? |
@evilguest The proposal is explicitly based on the idea that this is not a problem. While each index node in the B+-tree is larger than a single node in the AVL tree, many fewer nodes need to be modified during a mutation operation. For the sample collection size of 100,000 elements, the single-mutation memory allocation reduction is 39% on X64 and 21% on X86. There may be a few collection sizes that do not benefit from this, but I'm not sure what those are right now. |
Thanks. I am experimenting the same direction as https://github.com/tunnelvisionlabs/dotnet-trees and the factory implementation of the System.Collections.Immutable.ImmutableList is surprisingly hard to beat from the performance standpoint. E.g. ImmutableList.this[i] in tunnelvision's version is 2-10 times worse than the in MS one (depending on the list size). |
@evilguest Did you get to benchmarking your modifications? |
@arkalyanms the biggest help would be a bunch of benchmark scenarios to establish baselines. I can give you an overview of the in-flight work tomorrow. 😃 |
My benchmarking suite can be found at https://github.com/evilguest/atropos/tree/main/Atropos.Benchmarks |
I am not yet happy with the List implementation because of both memory overhead and underperformance for the modification operations. |
I'm potentially interested in working on this. |
@madelson probably the best first step is to ensure that the scenarios we care about are fully covered in dotnet/performance, before making any changes here. Certainly for elapsed time - @kunalspathak does the lab detect allocation regressions as well? |
No we do not have those that we could migrate. But @sharwell generally handles performance for Roslyn overall so expect he'd be testing our key scenarios as a part of evaluating this change |
Note that Roslyn is becoming a bit desensitized to this issue. Since many of our performance-sensitive areas construct a collection which is subsequently read but not mutated, we've been migrating to segmented collections (essentially a 2-level B+ tree with variable-sized index level and constant-count leaves). CPS and MSBuild are two other candidates for this performance testing though. |
I'm going to leave this comment here to remind myself of a future API proposal if this issue makes it in. Unsolicited Suggestion 1Th current proposal is for a fixed b. But there is value in a configurable b. Immutable Collections are used in a broad range of applications and allowing the user to choose the arity, gives developers more flexibility. Need faster read performance? Great, increase the arity to make scans and seeks faster. Unsolicited Suggestion 2As a side note, I would also envy a scan API that would take advantage of the internal structure and allow block based iteration. This shape might need some work to allow the action to stop the scan early. public static class CollectionsMarshal
{
public static void Scan<TArg, T>(ImmutableList<T> list, TArg arg, ReadOnlySpanAction<TArg, T> action);
public static void ScanRange<TArg, T>(ImmutableList<T> list, TArg arg, Range range, ReadOnlySpanAction<TArg, T> action);
} |
To try it out, I made a prototype of a B+ tree
Benchmark info:
|
Those numbers look great. Indexing and adding looks good. I'm interested in moving forward with this. |
@AArnott if this is something I can contribute, how would you recommend going about the development? For example, would it make sense for me to do a writeup of the proposed design to get feedback/alignment on that as a first step? I'm also curious what the benchmarking/validation process should be. I did add some additional benchmarks to dotnet/performance (one set got merged, the other is still in PR), but I could imagine that for a change like this it could be desirable to have additional more detailed benchmarks for a side-by-side comparison (as I've posted above). There was also discussion of trying out the new approach in MSBuild. In addition, some way to run the benchmark suite on different operating systems seems important. |
Good questions, @madelson. While I wrote the current implementation of the immutable collections, I don't typically work in this repo so I think we should wait for some other folks who accept PRs to this repo to provide the guidance you're looking for. I don't see the Builder pattern included in your prototype. I'll be interested in seeing how you can incorporate that. The memory/perf characteristics of our current builder implementation is that creating a builder based on an immutable collection is O(1) in time and memory. Switching back from builder to immutable collection is O(n) in time and O(1) in memory, where n is the number of nodes actually changed while in Builder mode. Can we achieve that with your B+ tree? |
Makes sense.
I've thought about this a bit but haven't built anything. Obviously the fact that I'm using arrays as nodes prevents us from following the current pattern of having a "frozen" bool on each node. The first step is representing mutability. For internal nodes, I believe we can use the highest-order bit of the Using this marking system, we can create a new builder in O(1) time/memory and we can convert back to an immutable structure in-place by traversing the tree and freezing all mutable nodes we encounter (short-circuiting on already-frozen nodes) just like we do today. One design question with this approach is how much we should try to unify the builder and immutable methods. For example, the builder methods would always need to be masking out the first bit when reading Another design question is whether the builder should attempt to use partially-full arrays to eliminate allocations. For example, in immutable world if we have a leaf of size 4 we represent it with just a 4-element array. Same thing goes for internal nodes. In builder world I see 2 options: With approach (1), we end up allocating a new leaf node for every builder add/insert, and much less frequently we'll need to allocate new internal nodes as well. In contrast, approach (2) can sometimes add/insert without creating any new nodes. On the other hand, when using approach (1) converting back to the immutable structure is trivial whereas for approach (2) we either have to "compact" many nodes or we have to make the immutable structure able to handle partially-full nodes and accept the loss of space efficiency. Approach (1) also leads to simpler code because we can use the array's length as the node count whereas with approach (2) we have to flow the true count down from the parent node. All things considered, I would probably lean towards approach (1) because it seems simpler and because the existing algorithm's builder has to create a new leaf node for every add/insert anyway so we're not losing a ton there, but I'm not sure. I'm kind of assuming that builders are a niche use-case since you're not using |
@madelson I'd be interested to see those benchmarks run against the implementation in tunnelvisionlabs/dotnet-trees. I have a builder implementation there which adheres to the complexity bounds mentioned by @AArnott. |
@sharwell here's what I got when I added TunnelVision's
|
I added a basic builder implementation to the prototype. For now the builder just supports indexer get/set and ToBuilder/ToImmutable. However, this should illustrate how the builder can function in general and it meets the performance criteria above. Benchmarks:
The benchmark I'm using does the following:
|
A few more benchmarks to share:
Some notes:
|
Added a stack-based enumerator like what
|
I haven't studied the perf numbers, but wanted to respond to one of the questions about memory allocations: people use the builders to reduce allocations and improve perf of creating a structure that will be frozen later. Allocating and copying an array on every Add would defeat that. For that reason I support having 'slack' in the array at least during the builder's use. |
@AArnott can you clarify what you mean by "cutting the slack" here? Are you coming down in favor of fast, alloc-free ToImmutable (no slack in the arrays) or in favor of more expensive ToImmutable() in exchange for fewer allocations on the individual builder operations? The topic of how the builder should work is one I've given a bit of thought to and I think there are some interesting considerations / options / tradeoffs worth mentioning. First off, consider the approach where there is no slack (arrays are always full). While this might seem like it defeats the point of having a builder, it still offers real benefits:
Next, consider the approach where we do have slack to avoid recreating leaf arrays on every change:
Finally, I'll note a couple of "in-between" options that might be worth thinking about:
Sorry for the big wall of text, but curious to hear your thoughts, especially what specific factors / benchmarks should influence the decision. |
Fewer allocations in the Builder demand slack. ToImmutable's lean result demands no slack. So the direction I'm leaning is to create arrays with slack in the Builder, and then trim the slack away during ToImmutable. It wins here and loses there, but as Builders are used to initialize large collections, I think these are the right trade-offs. |
I just finished reading your last comment. Great points. I guess considering arrays at each node wouldn't exceed some (small) length (right?), allocating another array and copying elements over might not be too bad. But it would be interesting to measure filling 15000 items in a collection both ways to see how perf is impacted (and how much GC pressure each produces). Maybe you've already collected that data above... I haven't studied the tables.
If there is a significant perf win, and we have adequate tests for all the code, if the code isn't all shared, that feel ok to me.
Such optimizations sound great. And then there's AddRange(IEnumerable) which presumably should allow you to perform optimally (particularly if a Count property is available via conditional casting). |
Note that the implementation I worked on has different slack characteristics. While the implementation uses fixed-size inline arrays, it separately guarantees minimum slack exists in cases where a list is constructed from beginning to end (e.g. starting from empty and calling |
By this do you mean that the resulting B+ tree is "compact" in the sense that a naive implementation of If so, that's also what I'm doing in the array-based implementation although I haven't built this out for the builder yet.
@AArnott are you imagining randomized insertion or |
@AArnott ok I had some time to code up Builder.Add() with allowance for slack in the leaves to reduce allocations. Here are the numbers:
Interestingly, without the slack optimization the timing numbers are comparable, but the allocation numbers are worse (ratio ~2.0 instead of ~0.25 like we see here). Therefore, I think having slack in the leaves is a good build for the builder. There's a good amount of room to tune as well. Right now we always expand the node to maximum size but we could instead do something more conservative like going to max / 2 on the first expansion which might be better for insertions. We could also switch from a no-slack mode to a slack mode based on a number of expansions performed like you mentioned. I'm not sure either of those is worth exploring now but it could make sense in the future. |
Those numbers look good. I prefer the "prototype" columns to the "TunnelVision" column as a few % points of more memory seems totally worth being several times faster. |
@AArnott note that the time ratio for my branch is primarily caused by the fact that the inline 8 element array search has not been vectorized and uses a coding pattern in the indexer which the JIT doesn't know how to optimize. It is designed to support branchless lookups on machines with 256-bit registers at each level of index once that work is complete. |
@sharwell are you referring to something that would optimize these switch statements? If so, could
Not sure if this is making some invalid assumptions about layout. |
That's part of the issue, but it can go much further. For example, an 8-element binary search could be implemented in a completely vectored form based on code similar to https://fioraaeterna.tumblr.com/post/26644202378/fast-binary-searching.
This was implemented in tunnelvisionlabs/dotnet-trees#102, but there are concerns that this may violate layout assumptions made by the JIT. Switching to |
I want to poke this issue to see how I can help move this towards the finish line. There is some great work that has been done so far and I think folks will benefit from it if we can get it there. I'm assuming more comprehensive testing needs to be done, but are there still outstanding design questions? |
Without re-reading the whole thread, IIRC the issues are:
The time required can be hard to come by, but your pinging here helps us understand community demand for it so we can prioritize accordingly. |
Currently the
Node
class used forImmutableList<T>
includes 40 bytes of overhead per list element on X64, and 24 bytes of overhead on X86. This means the memory cost of using this data structure for anImmutableList<int>
is 1000% on X64, and 600% on X86 (these figures represent the amount of memory used for storing non-data elements associated with the structure).I propose migrating to a B+-tree instead, using a fixed value for b and declaring data values in the index and data pages directly (to avoid unnecessary memory overhead for arrays). The choice of b primarily affects the following things:
In addition to the above, B+-trees provide a substantial memory savings by removing the child pointers (the Left and Right pointers for AVL nodes) from the leaves.
My initial calculations (which could be completely wrong) indicate that the average memory allocation required for a single mutation operation on a list of 100000 elements using the AVL structure is 775 bytes for X64 and 493 bytes for X86. The optimal value of b for X64 in terms of mutation allocation overhead is b=8, which costs 476 bytes per operation on a list of the same size (39% reduction). For X86 this same value of b costs 388 bytes per mutation (21% reduction, and b=6 would be optimal with 382 bytes). The overall storage overhead for b=8 drops a staggering amount down to 169% for X64 and 119% for X86 (counting the average page at 75% capacity, based on a merge at 50% and split at 100%).
The text was updated successfully, but these errors were encountered: