-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove TransitiveClosure param from Space::trace_object and Scanning::scan_object #559
Comments
I feel we should think more carefully about this. I thought I agree that For your proposal:
|
I looked at various implementations. While most use cases use it as a real transitive closure in the sense of graph theory, other use cases are, in my opinion, abuses. They actually need callback functions that consume either an object or an edge. What is
|
@wks I believe it should be compatible with markcompact. |
Some history of the
From step 2, we see
And they should be separated. |
I tried to change the return value of
|
Sorry, I was wondering what machine did you use and how many invocations did you run it for? |
20 invocations on hare.moma |
I developed two solutions for removing `TransitiveClosure.
I prepared three builds for testing. All used OpenJDK, and the
I ran I ran the test three times, with slightly different configurations. The first timeIn the beginning, I only developed the I used 6x heap size, 20 iterations, 5 invocations each. Comparing build1 and build2:
Plot: Both box plot and violin plot are drawn. The yellow line in the box is the median, and the yellow line in the violin is the mean. The data point of all 20 iterations are also drawn. From the result of the first time, I thought build2 has significant performance impact, especially with Immix. Then I coded a more conservative approach (still passing the queue, not using return value for queueing) in the The second timeRunning the same tests (with the same built binary of build1 and build2) with 8x heap size (some tests failed with 6x heap size) and 40 invocations, and the results are a bit different: Comparing build1 and build2:
Plot: Both box plot and violin plot are drawn. The yellow line in the box is the median, and the yellow line in the violin is the mean. The data point of all 40 iterations are also drawn. Strangely, this time, build2 was faster than build1 with Immix and GenImmix, although the binaries are the same. The third timeThen I ran the test with all of build1, build2 and build3. (build1 and build2 are still the same binary) This time using 6x heap size (and no runs failed) and 40 iterations. Comparing build1 and build2:
Comparing build1 and build3:
Plot: Both box plot and violin plot are drawn. The yellow line in the box is the median, and the yellow line in the violin is the mean. The data point of all 40 iterations are also drawn. This time, build3 is the slowest, although I thought build3 should be very similar to build1 while build2 may have some visible overhead. ConclusionIt seems that the STW time varies too much, and even 40 invocations are too few to show which build is the clear winner. Either the impact is insignificant, or I need a better way to do the experiment. |
Might be worth using a different benchmark such as EDIT: Going by the LXR paper, |
Solution 2 looks straightforward. It just moves the enqueue from Solution 1 adds more clarity to the code. The policy's |
@k-sareen I re-run the last test on bobcat.moma. It is exactly the same source code and Comparing build1 and build2:
Comparing build1 and build3:
It looks like there is no clear winner, as the results of all builds are quite noisy. But the plot for SemiSpace concerns me, because like on hare.moma, build2 and build3 seem to be systematically slower than build1. |
Bobcat is an Alder Lake CPU which can give funky results dude to its Efficiency cores. Did you make sure that the benchmarks were ran only on the Performance cores (something like Though yes, it does seem quite weird that SS's performance is slightly worse. Not quite sure why it would be slower for build 3. |
@k-sareen No. I didn't do anything special about big/small calls. But
I guess it may have something to do with function inlining. I'll check. |
Repeated the last experiment for SemiSpace on bobcat.moma. This time I have 9 builds:
That is, I duplicated the I expect the distribution of build1x, build1y and build1z to be identical, because they are the same binary. The same should be true for build2{x,y,z} and build3{x,y,z}. On the contrary, If build2 is really worse than build1, I expect all of build2{x,y,z} to be consistently worse than build1{x,y,z}. The same applies for build3. But from the plot, it is not the case. The distribution can vary greatly even for the same actual build (i.e. the same binary). I conclude that there is no significant difference between build1, build2 and build3. If there is any, it is less significant than the noise. FYI, in the following plot, each column contains all the combined 120 data points for each actual build (build1 = build1x + build1y + build1z, ...). |
I ran the last benchmark after the work stealing PR was merged. 40 iterations, 5 invocations each, 6x min heap size. build1=master, build2=trace-object-return2, build3=object-queue-trait ("build4" is identical to "build1" except its directory name is extremely long. I thought the directory name length may affect the live objects in the heap that holds directory names (such as jar files), causing SemiSpace to copy more data than others. But from the data, it does not seem to be the case.) Running SemiSpace alone for 120 iterations (build1{x,y,z} are the same, ...): Combined: It looks like the work stealing PR made tracing more efficiently, and amplified the impact of build3. It seems to impact the GC algorithms with more copying. I'll look into that. |
I wrote a micro benchmark that creates a 22-level binary tree, and then run Run it on build1 and build3, interleaving, 5 invocations each. That resulted in 500 data points (GC time for each I ran that on my local machine, a Coffee Lake without any special tuning such as disabling frequency scaling. The result is quite obvious. It is not noise. |
Found the problem.
The evidence is the result of Output of
The output of the
I changed the annotation to
The plot after changing to I'll further investigate other spaces to make sure their |
Fortunately, I also tried adding So I just added |
This is the result of lusearch on bobcat.moma after fixing the inlining issue in build3 (other builds are not changed). 5 invocations each, 20 iterations, 6x heap size. Comparing build1 and build3 (I decided to discard build2 in favour for build3):
Plot: The difference between the mean times of build1 and build3 is very small. Their distributions of the data points are similar, too. |
TL;DR: The type parameter of
XxxxxSpace::trace_object<T: TransitiveClosure>
and thetrace: &mut T
parameter can be removed, and the information should be carried by return value.TL;DR2: The
trace: &mut T
parameter ofScanning::scan_object
should also be removed and replaced by a lambda. (See comment below)TODO list:
<T: TransitiveClosure>
parameter fromXxxxxSpace::trace_object
<T: TransitiveClosure>
parameter fromScanning::scan_object
TransitiveClosure
trait completelyWhat?
Each space (like this one) has a
trace_object<T: TransitiveClosure>
method that takes atrace: &mut T
parameter.We mentioned the removal of the
<T: TransitiveClosure>
type parameter in #110. In that issue, @qinsoon mentioned that once we remove plan-specific ProcessEdgesWork, we can remove it.But I think we can remove it earlier than that, and plan-specific edge-processing is not even the core of the problem. The
trace
parameter is more like an "out parameter" in some languages.What does
trace: &mut T
do?It only serves one purpose for any Space: to enqueue the newly marked object.
Space::trace_object
does two things:So the result of the function should also consist of two parts:
The current
trace_object
signature isWith this signature, the return value
ObjectReference
only carries the information of part (2). Part (1) is done by callingtrace.process_node(object)
just before returning. See:MallocSpace
: here, just before returning.CopySpace
: here, just before returning.ImmixSpace
: without copying and with copying, just before returning.MarkCompactSpace
: marking and forwarding, almost just before returning.LargeObjectSpace
: here, just before returning.Currently, the
T: TransitiveClosure
type parameter can only be instances ofProcessEdgesWork
, and theProcessEdgesWork::process_node(object)
has only one implementation. It simply addsobject
intoProcessEdgesBase::nodes
. This makes the polymorphism againstT: TransitiveClosure
pointless, for it is not polymorphic at all.What should it return?
trace_object
should have returned this:MallocSpace::trace_object
should returnTraceObjectResult(true, None)
if called on a "grey" object, andTraceObjectResult(false, None)
if the object is "black".If an object is in a from-space,
CopySpace::trace_object
should returnTraceObjectResult(true, Some(newobj))
the first time it visits an object, butTraceObjectResult(false, Some(newobj))
for subsequent visits. For objects in to-spaces, the second part should beNone
.The Immix space, depending on whether it is defragging, may either return
TraceObjectResult(xxx, None)
orTraceObjectResult(xxx, Some(newobj))
.How should it be used?
trace_object
is called inProcessEdgesWork::process_edge
. The default implementation is:ImmixProcessEdges
andGenNurseryProcessEdges
override it, but the only difference is the way to decide whether to store back to the slot. With this change, it only needs to be implemented in one way:Discussion
Is it compatible with mark-compact? @tianleq
Is there any performance issue?
The text was updated successfully, but these errors were encountered: