Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API discussion: keeping duplicated methods xyz/xyz_async vs having an async kwarg? #2598

Open
bsipocz opened this issue Nov 23, 2022 · 8 comments

Comments

@bsipocz
Copy link
Member

bsipocz commented Nov 23, 2022

While I was reviewing #2597, I run into this quasi-dilemma, but I raise it as a separate question for @keflavich and @andamian, as well as @ceb8.

We have a lot of historical baggage around our API, and I would like to revisit some of it to see whether they still make sense, or we should slowly deprecate them out. At some point we need to switch versioning anyway, so one big breath and breaking API doesn't feel off the table.

So, the question for async and sync jobs: we traditionally have method_async and method, auto-generated with async_to_sync. But some modules don't follow this, and some modules manually have duplicated methods. So I wonder, now as we move on towards using proper VO tools, and pyvo, whether we want to revisit our API preferences, and would rather go with this kwarg-driven approach? Or the method duplication is better? (And frankly, it's baked into the API too much already that if it's the same, we could just stick with it).

I very much interested in the take from Adrian, whether it made any sense to you while redoing e.g. alma, or it was something to go along with?

@andamian
Copy link

alma has only sync access but cadc has both. The magic of async_to_sync is more trouble than it's worth it. I just realized some mistakes I've made in cadc now that I'm specifically looking at this.

The sync case is simple so I'm not going to discuss it. For the async one, the main difference is that the control returns to the client while the request is being processed but if followed back (and successful) it can return same/similar result. To me, this looks very similar to the stream attribute in requests functions: by default query_xyz should return the results (whether behind the scene is sync or async) but if the user wants more control and service supports async requests, it could return from the method before the request is completed and let user take over (poll the service and access results when done).

With async requests, there's also the question of the request ID. In IVOA, services return a job with an ID. With that job ID, the user can check the status of the job and access the results when job is completed. astroquery might need to recommend a function that takes a job ID and returns job details data associated with it that can be executed any time. In cadc, we expose the pyvo.Job class which works with VO services but probably not with others.

To summarize, if backwards compatibility is not an issue I would get rid of async_to_sync and add the optional async parameter to methods that use async services. Mind you, my experience is mainly with VO services (which BTW could have a default implementation to achieve that.) and I don't know if it works with others.

@keflavich
Copy link
Contributor

Using a kwarg instead of the async-to-sync magic makes sense to me, so if we want to start overhauling services, we can do that.

My vague sense is that the _async methods have not been widely used outside of development and debugging.

We should do a little research on this.

https://github.com/search?q=query_region_async&type=Code turns up almost all hits to astroquery tests in various forks.

query_region is harder to search for, but
https://github.com/search?q=simbad.query_region&type=Code
turns up several user use-cases, while https://github.com/search?p=2&q=simbad.query_region_async&type=Code is just astroquery forks.

So if we want to make a change, let's spell out the proposed API here and discuss further. I'm hearing: remove _async methods, and instead add async=True/False. For most services, the return will still be a requests response object. Is that correct?

@andamian
Copy link

andamian commented Nov 25, 2022

Yes. Missing async or async=False will still return a request.Response. The question is what async=True should return?

For VO services, as I've said, it could return a UWS job. The pattern is well documented.

Do we have any idea/examples of async non-VO services? I just assume that they return a URL where the data is being staged. The URL is inaccessible until data becomes available. At least this is the pattern that ALMA used to have if I remember correctly. If that's the case, the simplest is to assume that this is what the return should be - a response with a redirect URL that the client needs to check for when the results become available. We'll need to figure out if that's possible to achieve with VO services too. With these services we might want to return more details about the jobs for clients that could make use of that info.

@keflavich
Copy link
Contributor

No, if we're talking about a unified method, async=False returns the parsed table, not a response object.

The original idea for async was to enable asynchronous/parallel downloads from normal HTTP web services, not from VO. We only later encountered services that actually had separate query & staging processes, and those were not handled through the async mechanism. besancon is one example, iirc.

@ceb8
Copy link
Member

ceb8 commented Nov 28, 2022

@keflavich My experience (primarily with MAST, but I've been thinking about the construction while working on the new NEOCC service too) matches with yours in that the async versions have been primarily useful for debugging for modules that don't have a truly asynchronous option.

@bsipocz
Copy link
Member Author

bsipocz commented Sep 27, 2023

@andamian:

Yet another occasion when I'm running into the async vs sync question:

  • at least in the TAP cases, shouldn't we default to run an async search instead, and maybe expose sync to the users? Currently we default to pyvo's default, which looks to be sync.

@andamian
Copy link

Not an easy choice, especially with TAP. sync operations are those that return fast (2min is probably a good rule of thumb). More than that and the connection might time out. The async mode with that in mind.

For TAP, the duration of the query is not deterministic, so async might be a safer bet. Besides increased the increased complexity of dealing with jobs, there are other drawbacks: the response time (results are only ready when the job completes and data is staged) and possible restrictions on the size of the staged data.

The client can also implement a hybrid mode, where the API is sync, but underneath uses the service async endpoint to start jobs and periodically check their status until they complete. In this mode they do not have to deal with the jobs.

Does this help at all?

@bsipocz
Copy link
Member Author

bsipocz commented Sep 27, 2023

OK, gotcha, this helps a lot, thank you.
So, based on the archive preference, we may have different defaults whether we default to sync or async (e.g. we may want to do async for irsa). However, I think there is a fundamental agreement that we should move away from duplicating methods (let alone when they are not even real async ones), in favor of a kwarg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants