-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16673: [Java] Integrate C Data into allocator hierarchy #14506
Conversation
lidavidm
commented
Oct 25, 2022
•
edited
Loading
edited
- Document some of the memory management internals
- Note that in practice, you have to use the concrete implementations of certain interfaces
- Begin integrating C Data into the full memory management stack instead of trying to patch it in at the edges
java/memory/memory-core/src/main/java/org/apache/arrow/memory/BufferLedger.java
Outdated
Show resolved
Hide resolved
TODOs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me so far.
java/memory/memory-core/src/main/java/org/apache/arrow/memory/ForeignAllocation.java
Outdated
Show resolved
Hide resolved
@@ -18,8 +18,10 @@ | |||
package org.apache.arrow.memory; | |||
|
|||
/** | |||
* Reference Manager manages one or more ArrowBufs that share the | |||
* reference count for the underlying memory chunk. | |||
* ReferenceManager is the reference count for one or more allocations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure "ReferenceManager is the reference count..." is completely accurate. Maybe "ReferenceManager is the reference count manager..." or "ReferenceManager tracks the reference count..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between these? Given the public interface is a set of reference counting methods, I'd say it is a reference count. (And for instance, the way you manipulate the reference count of an ArrowBuf is by getting its reference manager.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a nit for sure. Ignore it
@@ -91,7 +95,7 @@ public interface ReferenceManager { | |||
/** | |||
* Transfer the memory accounting ownership of this ArrowBuf to another allocator. | |||
* This will generate a new ArrowBuf that carries an association with the underlying memory | |||
* for the given ArrowBuf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method comment is misleading. While it can transfer the memory accounting ownership ... to another allocator, it can also create a new ArrowBuf in the same allocator if the allocator argument is the same as the existing allocator. This can be useful for creating a new Vector over the same underlying memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try to reword this a bit, hope that looks better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
java/memory/memory-core/src/main/java/org/apache/arrow/memory/AllocationManager.java
Show resolved
Hide resolved
497bb3c
to
edaf4ee
Compare
Still haven't updated docs, but all types are implemented now. There's two failure modes that need tests, then fixes. |
import org.apache.arrow.memory.BufferAllocator; | ||
import org.apache.arrow.memory.ForeignAllocationManager; | ||
|
||
/** Manage the reference count of an imported C Data ArrowArray. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but I'm a bit concerned that the class name and the fact that it does not implement ReferenceManager obscures the fact that it does reference counting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a fundamental mismatch in memory handling designs and this cannot implement ReferenceManager. I will update the docstring and reconsider the name but I do not think anything else is possible, not without entirely overhauling the core implementation. Even if in theory we can work with only the interfaces provided, in practice we need to fit into the concrete implementations, and so BufferLedger is unavoidable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did look at it a little bit. But I don't think you can use any implementation of ReferenceManager but BufferLedger if you want everything else to work. It does feel bad to have effectively two redundant reference counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand. It was a 'it would be nice if....' comment
@lidavidm I would like to update the tutorial docs for memory management after this is implemented to make sure I fully understand it all, assuming that works for you. |
That would be much appreciated, thank you. It would be good if we could all check our understanding. |
FWIW, I think this is ready. Maybe @emkornfield and/or @zhztheplayer would also like to take a look and see how the proposed solution works? |
@emkornfield and/or @zhztheplayer, do you think you will have time to consider this? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
BTW In future maybe we can consider integrating AllocationListener
into this allocation path too?
UnsafeForeignAllocationManager allocationManager = new UnsafeForeignAllocationManager(allocator, bufferSize); | ||
try { | ||
assertEquals(0, allocator.getAllocatedMemory()); | ||
ArrowBuf buf = allocator.wrapForeignAllocation(allocationManager); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What sort of optimization are you thinking?
It would be nice to avoid the double reference count but I think that will require some more intricate surgery on the allocator internals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What sort of optimization are you thinking?
To make AllocationManager creatable without an allocator, e.g.
UnsafeForeignAllocationManager allocationManager = new UnsafeForeignAllocationManager(bufferSize);
ArrowBuf buf = allocator.wrapForeignAllocation(allocationManager);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to avoid the double reference count but I think that will require some more intricate surgery on the allocator internals.
(Sorry for asking more on an already approved PR, indeed these are all non-blocker questions)
Regarding the double reference counting, I am not sure about why it's actually needed? In the previous PR we didn't actually touch anything about ref counting and things went well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It didn't actually implement the API. It just suppressed an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new AllocationManager(allocator)
creates an zero ref count owning ledger. I guess that ledger may lead to unexpected behavior when user call wrapForeignAllocation(...)
from another allocator. Maybe decoupling allocator from AllocationManager's constructor (not the manager itself) will be a solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I don't see what the issue is. That ledger is the point, it's what links the allocation to the allocator.
You shouldn't be wrapping the same allocation twice. Not sure we can really do anything about that, this is inherently a somewhat unsafe API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered something like:
BufferAllocator child1 = ...;
BufferAllocator child2 = ...;
ForeignAllocationManager allocationManager = new ForeignAllocationManager(child1, size, memoryAddress);
ArrowBuf buf = child2.wrapForeignAllocation(allocationManager);
It looks like optimizable if we tell user it's a must to instantly call wrapForeignAllocation
on same allocator after creating the AM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok, thanks. I think I see now. I'll give that a test, and I think I might change it so that the ForeignAllocation is only a release callback, pointer, and size. That way the details can all be properly encapsulated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the latest commit.
* <p>There is a fundamental mismatch here between memory allocation schemes: AllocationManager represents a single | ||
* allocation (= a single address and length). But an ArrowArray combines multiple allocations behind a single | ||
* deallocation callback. This class bridges the two by tracking a reference count, so that the single callback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But an ArrowArray combines multiple allocations behind a single deallocation callback.
Wasn't the deallocation of an ArrowArray always follow the postorder-traversal of the array tree? Or maybe I just got something wrong about the deallocation process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order does not matter here. It is about one vs many.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I got it after going through the code. Thought the comment is related to the struct array -> child array case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok yeah. I removed the original comment about the struct/child array case since it is just not possible to separate the child allocations from the parent, unfortunately.
39a7f29
to
aa8c22e
Compare
/** | ||
* An allocation of memory that does not come from a BufferAllocator, but rather an outside source (like JNI). | ||
*/ | ||
public abstract class ForeignAllocation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@lidavidm Is it possible to merge this PR today? |
Given that it's touching some core APIs, I'd like to see some review…I can go and make it clear that the APIs added here are experimental, at least for now |
IDK if you saw @zhztheplayer's approval from last night. I've been following along, but LMK if you want me to formally approve the latest changes. |
Yes, and it's appreciated; I was hoping @emkornfield might also be able to comment on the proposed design, though. It shouldn't hold us back, I just want to go through and make it clear where the new APIs are experimental or meant for internal use. |
aa8c22e
to
92a7fc8
Compare
I'll merge this for now and we can revisit the design if issues come up. |
Benchmark runs are scheduled for baseline = b1f65ea and contender = 7a47e8d. 7a47e8d is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This PR bumps Apache Arrow version from 10.0.0 to 11.0.0. Main changes related to PyAmber: ## Java/Scala side: - Distribute Apple M1 compatible JNI libraries via mavencentral ([#14472](apache/arrow#14472)). - Improve performance by short-circuiting null checks when comparing non null field types ([#15106](apache/arrow#15106)). - Extend Table copy functionality, and support returning copies of individual vectors ([#14389](apache/arrow#14389)). - Several enhancements to dictionary encoding ([#14891](apache/arrow#14891), ([#14902](apache/arrow#14902), ([#14874](apache/arrow#14874)). - Extend Table to support additional vector types ([#14573](apache/arrow#14573)). - Enhance and simplify handling of allocation management by integrating C Data into allocator hierarchy ([#14506](apache/arrow#14506)). ## Python side: - PyArrow now requires pandas >= 1.0 ([ARROW-18173](https://issues.apache.org/jira/browse/ARROW-18173)). - Added support for the [DataFrame Interchange Protocol](https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html) for pyarrow.Table ([GH-33346](apache/arrow#33346)). - Support for custom metadata of record batches in the IPC read and write APIs ([ARROW-16430](https://issues.apache.org/jira/browse/ARROW-16430)). - The Time32Scalar, Time64Scalar, Date32Scalar and Date64Scalar classes got a .value attribute to access the underlying integer value, similar to the other date-time related scalars ([ARROW-18264](https://issues.apache.org/jira/browse/ARROW-18264)). - Casting to string is now supported for duration ([ARROW-15822](https://issues.apache.org/jira/browse/ARROW-15822)) and decimal ([ARROW-17458](https://issues.apache.org/jira/browse/ARROW-17458)) types, which also means those can now be written to CSV. ## Issues fixed: - Now Do_action (from Python server back to Java Client) is returning a stream of results properly, and it alerts when the results are not fully consumed by the client. Such results will be used to send the flow control credits back from the Python side. We limit the results to be exact 1 for now, although it can be a stream. - Fix a bug in the Python proxy server, when unregistered action is invoked, it should not parse and return the results.