-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose a static object store registry #1072
Conversation
@yahoNanJing @alamb @houqp I created this PR as I noticed that this change was proposed as part of #1062 and I was also reaching a similar conclusion as part of rdettai#1. I think it is worth discussing it separately, as this solution comes with drawbacks. |
Could you elaborate on the root cause for requiring this static? I was unable to reason about
|
The problem is mainly present in Ballista:
I am not really happy about this solution either, so any alternative solution is welcome! |
I think the alternative would be to thread the ObjectStoreRegistry on some structure down to all the places it was needed (e.g. perhaps the Basically I don't see the ObjectStoreRegistry as something that is "owned" by the |
Interesting, from the discussion we had in rdettai#1, I got the opposite impression on where we are heading ;P I thought the conclusion from that discussion was:
With this in mind, do we still need a global singleton registry? The objectstore registry is already attached to ExecutionContext today, so it should be good enough right? |
I opened this PR to formalize the discussion, sorry if it brought in more confusion than clarity 😃. Let me try to summarize the full context.
If I understand correctly, @houqp is leaning toward solution (2). Even though it lacks flexibility, it uses the same mechanism Ballista is already using for table providers, so I would also go with it. So with solution (2) the Ballista flow would be:
|
I like this enumeration of possibilities (1) and (2). 👍 I think solution (2) is also a reasonable approach. Another thought I had if we want to do (1) without a static Not the cleanest solution, but it would avoid trying to plumb something into the serde serialization |
Thank you all for your input! Closing this as I feel we have discarded the implementation with |
Thanks @rdettai for the detailed write up! I think going static should be good enough to unblock our development in the short run. But I agree with you that this is just a short term workaround. To make object store truly pluggable, we need to serialize them into unique values, e.g. uri scheme,, stored in generic strings instead of enum in protobuf. This way, if a user compiles in a new object store using a custom crate, they can still get it to work without having to change the ballista protobuf file. In fact, I think we need to do the same thing for table provider as well, hardcoding table providers in protobuf leads to the same restriction. For example, it's not possible to use delta-rs's custom table provider with ballista at the moment. Given that the current logical plan deserialization code only takes serialized protobuf plan as input, I think we would have to go with the lazy two pass deserialization approach proposed by @alamb . Alternatively, we can change the deserialization call in the scheduler to pass in the execution context. This will make it a lot easier to implement more dynamic deserialization logic for both object stores and table providers. I don't see a strong reason why we want to avoid referencing execution context during logical plan deserilization? |
this is a good point at makes sense to me
Likewise, I don't see any reason to avoid referencing |
The context contains a lot of different configurations, part of which will will be copied into the logical plan, part of which won't. So it seems to me that it will be kind of hard to figure out which configurations need to be consistently set across the different execution contexts across the cluster (either through the boot time config or through serialization along the query), and which are only needed on the node where the plan is created. I guess that in that case, the context should be structured into multiple tiers:
|
I think splitting execution context config into different tiers is a good idea to make it more maintainable. Then we can pass only the static tier config to the plan deserialization code to make it deterministics. |
Which issue does this PR close?
This is linked to #1010 #1062
Rationale for this change
Currently, we don't have a good way to make the
ObjectStoreRegistry
available in various places of the code. Indeed, to setup its object stores, the registry needs to have access to the implementation code (which might be in an other crate) at compile time and needs to have access to the proper configurations (such as credentials) when initialized. In particular in Ballista, we need to have the right registry configuration on the Executors and the Scheduler.Creating a singleton static object store registry is not ideal (it comes with with the usual curses that go with static), but solves (at least temporarily) some of the issues:
OBJECT_STORES
initialization, that will propagate the configuration to all the places where the registry is required (local, scheduler, executor)OBJECT_STORES
can be accessed in any place of the code: logical plan or physical plan deserialization, provider creation...What changes are included in this PR?
ExecutionContextState
ObjectStoreRegistry
calledOBJECT_STORES
Are there any user-facing changes?
ExecutionContextState
changes the config API but that part was likely not used yet