Transport: `AsyncTransport` plugin #6626

khsrali · 2024-11-21T07:27:00Z

This PR proposes many changes to make transport tasks asynchronous. This ensures that the daemon won’t be blocked by time-consuming tasks such as uploads, downloads, and similar operations, requested by @giovannipizzi.

Here’s a summary of the main updates:

New Transport Plugin: Introduces AsyncSshTransport with the entry point core.ssh_async.
Enhanced Authentication: AsyncSshTransport supports executing custom scripts before connections, which is particularly useful for authentication. 🥇
Engine Updates: Modifies the engine to consistently call asynchronous transport methods.
Deprecated Methods: Deprecates the use of transport.chdir() and transport.getcwd() (merged in Transport & Engine: factor out getcwd() & chdir() for compatibility with upcoming async transport #6594).
Backward Compatibility: Provides synchronous counterparts for all asynchronous methods in AsyncSshTransport.
Transport Class Overhaul: Deprecates the previous Transport class. Introduces _BaseTransport, Transport, and AsyncTransport as replacements.
Improved Documentation: Adds more docstrings and comments to guide plugin developers. Blocking plugins should inherit from Transport, while asynchronous ones should inherit from AsyncSshTransport.
Updated Tests: Revises test_all_plugins.py to reflect these changes. Unfortunately, existing tests for transport plugins remain minimal and need improvement in a separate PR (TODO).
New Path Type: Defines a TransportPath type and upgrades transport plugins to work with Union[str, Path, PurePosixPath].
New Feature: Introduces copy_from_remote_to_remote_async, addressing a previous issue where such tasks blocked the entire daemon.

Dependencies: This PR relies on PR 272 in plumpy.

Test Results: Performance Comparisons

When `core.ssh_async` Outperforms

In scenarios where the daemon is blocked by heavy transfer tasks (uploading/downloading/copying large files), core.ssh_async shows significant improvement.

For example, I submitted two WorkGraphs:

The first handles heavy transfers:
- Upload 10 MB
- Remote copy 1 GB
- Retrieve 1 GB
The second performs a simple shell command: touch file.

The time taken until the submit command is processed (with one daemon running):

core.ssh_async: Only 4 seconds! 🚀🚀🚀🚀 A major improvement!
core.ssh: 108 seconds (WorkGraph 1 fully completes before processing the second).

When `core.ssh_async` and `core.ssh` Are Comparable

For tasks involving both (and many!) uploads and downloads (a common scenario), performance varies slightly depending on the case.

Large Files (~1 GB):
- core.ssh_async performs better due to simultaneous uploads and downloads. In some networks, this can almost double the bandwidth, as demonstrated in the graph below. My bandwidth is 11.8 MB/s but increased to nearly double under favorable conditions:
- However, under heavy network load, bandwidth may revert to its base level (e.g., 11.8 MB/s):
  
  Test Case: Two WorkGraphs: one uploads 1 GB, the other retrieves 1 GB using RemoteData.
  - core.ssh_async: 120 seconds
  - core.ssh: 204 seconds
Small Files (Many Small Transfers):
- Test Case: 25 WorkGraphs each transferring a few 1 MB files.
  - core.ssh_async: 105 seconds
  - core.ssh: 65 seconds
In this scenario, the overhead of asynchronous calls seems to outweigh the benefits. We need to discuss the trade-offs and explore possible optimizations. As @agoscinski mentioned, this might be expected, see here async overheads.

--- update on 16.01.2025
Some of these changes has moved to a separate PR Engine: Async run #6708

codecov · 2024-11-21T08:00:17Z

Codecov Report

Attention: Patch coverage is 80.15695% with 177 lines in your changes missing coverage. Please review.

Project coverage is 78.01%. Comparing base (c88fc05) to head (b22338e).

Files with missing lines	Patch %	Lines
src/aiida/transports/plugins/ssh_async.py	73.15%	116 Missing ⚠️
src/aiida/transports/transport.py	86.33%	35 Missing ⚠️
src/aiida/transports/plugins/ssh.py	88.10%	10 Missing ⚠️
src/aiida/engine/daemon/execmanager.py	70.00%	9 Missing ⚠️
src/aiida/transports/plugins/local.py	92.31%	6 Missing ⚠️
src/aiida/transports/util.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6626      +/-   ##
==========================================
+ Coverage   78.00%   78.01%   +0.02%     
==========================================
  Files         563      564       +1     
  Lines       41766    42504     +738     
==========================================
+ Hits        32574    33154     +580     
- Misses       9192     9350     +158

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

agoscinski

Thanks! Looks good, just to reiterate most important comments:

Why don't you just use Transport instead of BlockingTransport, since you set it one to the other? Now you have redundancy. I feel like this API is clear to me.

_BaseTransport -> Transport -> SshTransport
_BaseTransport -> AsyncTransport -> AsyncSshTransport

Will you make a PR in plumpy there so we can do a new release?

Tests I will review in the separate PR

agoscinski · 2024-11-22T10:12:22Z

requirements/requirements-py-3.12.txt

@@ -119,7 +120,7 @@ pillow==10.1.0
 platformdirs==3.11.0
 plotly==5.17.0
 pluggy==1.3.0
-plumpy==0.22.3
+plumpy@git+https://github.com/khsrali/plumpy.git@allow-async-upload-download#egg=plumpy


Will you make a PR there so we can do a new release?

yes! Please review here: aiidateam/plumpy#272

agoscinski · 2024-11-22T10:17:23Z

utils/dependency_management.py

+                if (
+                    canonicalize_name(requirement_abstract.name) == canonicalize_name(requirement_concrete.name)
+                    and abstract_contains
+                ):


Do we remove this before merge? Otherwise it would be good to add some comment what the new if-else does. Hard to understand without context

I plan to keep it, as it's very useful to pass CI when we make PRs like this, that are hooked to another PR, or branch of other repo with @

The problem is @ is not listed as a valid specifier in class Specifier.
This little change, basically, accepts @ as a valid specifier and will check if a hooked dependency is to the same "version" across all files, requirement-xx and enviroment.yml , etc...

This way, apart of this nice check, the dependency test fails and it still triggers the main unit tests test-presto , test-3.xx for such PRs.. (otherwise it won't)

I added a few lines of comment to clarify this

This is nice, perhaps would be better to separate into standalone PR for visibility.

btw: I started looking into using uv lockfile in #6640, seems like a better strategy than having to wrangle 4 different requirements files. :-)

As we discussed, this feature is already covered in the new PR #6640.
So I keep the changes temporarily for this PR only, and will revert 'utils/dependency_management.py' before any merge.

src/aiida/transports/transport.py

agoscinski · 2024-11-22T10:57:32Z

src/aiida/transports/transport.py

+    return str(path)
+
+
+class _BaseTransport:


Isn't this part of public API? I should use it if I create a new transport plugin? Or should I use Transport?

no this is private. No one should inherent from this except 'AsyncTransport', 'BlockingTransport'.
Only 'AsyncTransport', 'BlockingTransport' are the public ones -- to be used to create a new plugin--

I think it is problematic to have class like this, take the method get_safe_open_interval as example.

def get_safe_open_interval(self): """Get an interval (in seconds) that suggests how long the user should wait between consecutive calls to open the transport. This can be used as a way to get the user to not swamp a limited number of connections, etc. However it is just advisory. If returns 0, it is taken that there are no reasons to limit the frequency of open calls. In the main class, it returns a default value (>0 for safety), set in the _DEFAULT_SAFE_OPEN_INTERVAL attribute of the class. Plugins should override it. :return: The safe interval between calling open, in seconds :rtype: float """ return self._safe_open_interval

It says "Plugins should override it", then what is the point of define the method?

agoscinski · 2024-11-22T11:02:02Z

src/aiida/transports/transport.py

+
+
+# This is here for backwards compatibility
+Transport = BlockingTransport


I don't know if this makes sense to make blocking the default one, especially if you expose both of them in the API. Shouldn't there be a public class for Blocking and Nonblocking transport which one should use to inherit from?

This was just for backward compatibility as Giovanni suggested to call the former blocking Transport, now as, BlockingTransport

tests/engine/daemon/test_execmanager.py

agoscinski · 2024-11-22T11:04:26Z

tests/engine/daemon/test_execmanager.py

@@ -164,7 +167,8 @@ def test_upload_local_copy_list(
    calc_info.local_copy_list = [[folder.uuid] + local_copy_list]

    with node.computer.get_transport() as transport:
-        execmanager.upload_calculation(node, transport, calc_info, fixture_sandbox)
+        runner = get_manager().get_runner()
+        runner.loop.run_until_complete(execmanager.upload_calculation(node, transport, calc_info, fixture_sandbox))


why is this needed now?

Because execmanager.upload_calculation is now a async function.. this way we can call it in a sync test.

What happens if you use the old way? The test just passes and continues before finishing the command?

I think it is very tricky to mix up the async programming and sync function, it is in general a very hard problem. This looks to me the runner.loop.run_until_complete will block the running of the task until it complete so give no benefit after making these methods async. Is the create_task the correct thing to use?

Okay, I just asked Ali offline. This is only for tests and only for test the functionality of the implementation is correct. The async behaviors of four operations working together is not the purpose here.

src/aiida/transports/transport.py

src/aiida/transports/util.py

unkcpz · 2024-11-24T12:57:21Z

I am about to finish #6627 which I think can benefit for the tests here as well. Please hold a bit for that. I'll try my best to get that one merge by Wednesday.

khsrali · 2024-11-25T12:51:17Z

Why don't you just use Transport instead of BlockingTransport, since you set it one to the other? Now you have redundancy. I feel like this API is clear to me.
_BaseTransport -> Transport -> SshTransport
_BaseTransport -> AsyncTransport -> AsyncSshTransport

I just followed what @giovannipizzi suggested. But agreed this makes more sense, so I'm gonna apply this changes..

Will you make a PR in plumpy there so we can do a new release?

Will do once my performance tests are ready..

khsrali · 2024-11-25T13:49:31Z

Note to myself:
@danielhollas suggested we apply the changes directly on core.ssh rather than creating a new plugin core.async_ssh
I should investigate this..

utils/dependency_management.py

agoscinski · 2024-11-26T15:51:18Z

utils/dependency_management.py

+                if (
+                    canonicalize_name(requirement_abstract.name) == canonicalize_name(requirement_concrete.name)
+                    and abstract_contains
+                ):


agoscinski

some minor changes

src/aiida/transports/util.py

tests/plugins/test_factories.py

khsrali · 2024-12-05T16:31:24Z

Note:
tests are failing due to this issue aiidateam/plumpy#294

khsrali · 2024-12-05T17:09:27Z

Checklist:

To think whether unifying core.ssh, with core.ssh_async (and even core.ssh_auto) is possible and if so, should that has to be done here? or preferably in a separate PR.
Finalize and report the performance tests.
Merge ♻️ Make Process.run async plumpy#272 and release

unkcpz · 2024-12-05T17:19:39Z

tests are failing due to this issue aiidateam/plumpy#294

Hi @khsrali, I merge #6640, so it should work now I guess. Can you resolve the conflict and try it again? Thanks.

khsrali · 2024-12-05T17:51:57Z

Hi @khsrali, I merge #6640, so it should work now I guess. Can you resolve the conflict and try it again? Thanks.

Thanks @unkcpz , now I face issues I never had before, lol:

error: The lockfile at `uv.lock` needs to be updated, but `--locked` was provided. To update the lockfile, run `uv lock`.

actually I even tried to update the file using 'uv lock', still won't pass..

agoscinski · 2024-12-05T19:47:15Z

actually I even tried to update the file using 'uv lock', still won't pass..

Sorry for the experience. We are now trying uv out for the dependency management and installation. uv is a really useful tool but it is still a bit unstable. So for some reason the uv lock fails, you can see it when executing it in verbose mode uv lock -v. I don't know why the full backtrace of the error is also meaningless but what worked for me is to manually add the two packages you changed

uv add git+https://github.com/aiidateam/plumpy --branch async-run
uv add git+https://github.com/ronf/asyncssh --rev 033ef54302b2b09d496d68ccf39778b9e5fc89e2

I will push the fix now, but I basically only ran these two commands

khsrali · 2024-12-16T11:36:45Z

@agoscinski
I'll appreciated if you guys can give this PR, another round of review. -- I also asked @unkcpz, in the office) --

It would be nice to have it merged by the end of this week, because when I come back from holidays,
I'll lose half of my memory :-)))

unkcpz

I give the implementations a first go. I was only checking the test_all_plugins.py previous time where I also did changes.
TBH, I think the PR still requires some changes.
I personally think the huge inheritance pattern is the evil of a lot of our headaches, here it add more of this. Would you mind to have a read on

The protocol can fit for both sync and async function, which means the AsyncTransport can use the function name without "_async" as suffix. Then inside "daemon/execmanager.py", if the function is sync transport, it runs in the blocking manner in the coroutine, if it is async it is scheduled to the event loop.

For example:

remote_user = await transport.whoami() # instead of await transport.whoami_async()

In aiidateam/plumpy#272, the post https://textual.textualize.io/blog/2023/03/15/no-async-async-with-python/ was mentioned. For the transport, I think the idea can work well to have async usage under the hood and call sync function as well.

But anyway, it is more stylish requests from mine. I think the PR is a great effort to improve the performance with async ssh. I think @khsrali already did the most difficult part of understanding async behavior and benchmark workflow for proof the changes are correct. We can do a pair coding next year to also setter down the interface and stylish disagreement.

src/aiida/calculations/monitors/base.py

src/aiida/engine/daemon/execmanager.py

pyproject.toml

unkcpz · 2024-12-20T16:13:58Z

src/aiida/schedulers/plugins/direct.py

@@ -192,7 +192,7 @@ def _get_submit_command(self, submit_script):
            directory.
            IMPORTANT: submit_script should be already escaped.
        """
-        submit_command = f'bash {submit_script} > /dev/null 2>&1 & echo $!'
+        submit_command = f'(bash {submit_script} > /dev/null 2>&1 & echo $!) &'


I don't think this change is related, can you move it to another PR?

Command execution from asyncssh library required this annotation, otherwise will not await it, therefore this change is related to this PR.
I've checked this change and it has no effect on expected behavior of command execution in paramiko, so everything is safe.

Can you write some comments on why this change does not affect the behavior of bash?
If I have a scheduler that is not direct but still run the bash command will it conflict with asyncssh?

The only thing I can see happen since is that the printed PID with echo $! is now printed after the next command because it is run concurrently now. This could be critical if we would rely on the printed PID, we want to read the PID from the echo command but get a different output. But as far as I checked, we do not rely on the printed PID but retrieve the pid using some long ps command (see _get_joblist_command).

The gist is I don't think it interferes.

unkcpz · 2024-12-20T16:22:41Z

src/aiida/transports/transport.py

-__all__ = ('Transport',)
+__all__ = ('AsyncTransport', 'Transport', 'TransportPath')
+
+TransportPath = Union[str, Path, PurePosixPath]


To deal with generic path typing, it is better to cover more I think:

PathLike = Union[AnyStr, os.PathLike]

In side the function, I'd rather all use pathlib.Path instead of str. The reason is we are all move to pathlib.Path in other module among the code base.

Thanks for your suggeestion.
str still has to be supported, because there are plugins that have direct call on transport methods with srt paths. For example in QE, there exist one or two call. Other plugins I have not checked.

And about covering more types, I'd suggest we do it when a concrete usecase showed up.
AnyStr also includes bytes which I believe we don't need.
os.PathLike is very inclusive, and allows for custom paths, although I agree it's nice, but don't see why we would need that right now.

I defined that way, to be very specific what paths we support.

The type annotation change is not related to the main change here for adding AsyncTransport plugin, can you make it independent PR or commit to make it easy to see what are the main changes that requires for adding async transport?

separate PR would nicer, but at least mention this in the commit message if not.

unkcpz · 2024-12-20T16:30:56Z

src/aiida/transports/transport.py

+    return str(path)
+
+
+class _BaseTransport:


I think it is problematic to have class like this, take the method get_safe_open_interval as example.

def get_safe_open_interval(self): """Get an interval (in seconds) that suggests how long the user should wait between consecutive calls to open the transport. This can be used as a way to get the user to not swamp a limited number of connections, etc. However it is just advisory. If returns 0, it is taken that there are no reasons to limit the frequency of open calls. In the main class, it returns a default value (>0 for safety), set in the _DEFAULT_SAFE_OPEN_INTERVAL attribute of the class. Plugins should override it. :return: The safe interval between calling open, in seconds :rtype: float """ return self._safe_open_interval

It says "Plugins should override it", then what is the point of define the method?

src/aiida/transports/transport.py

unkcpz · 2025-01-10T12:34:20Z

Please be aware that failed test of py3.10 can be caused by the changes of this PR.

aiida_code_installed = <function aiida_code_installed.<locals>.factory at 0x7f624c74edd0>

    def test_get_builder_restart(aiida_code_installed):
        """Test :meth:`aiida.orm.nodes.process.process.ProcessNode.get_builder_restart`."""
        inputs = {
            'code': aiida_code_installed(default_calc_job_plugin='core.arithmetic.add', filepath_executable='/bin/bash'),
            'x': Int(1),
            'y': Int(1),
            'metadata': {'options': {'resources': {'num_machines': 1, 'num_mpiprocs_per_machine': 1}}},
        }
>       _, node = launch.run_get_node(ArithmeticAddCalculation, inputs)

tests/orm/nodes/process/test_process.py:88: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
src/aiida/engine/launch.py:65: in run_get_node
    return runner.run_get_node(process, inputs, **kwargs)
src/aiida/engine/runners.py:291: in run_get_node
    result, node = self._run(process, inputs, **kwargs)
src/aiida/engine/runners.py:261: in _run
    process_inited.execute()
.venv/lib/python3.10/site-packages/plumpy/processes.py:88: in func_wrapper
    return func(self, *args, **kwargs)
.venv/lib/python3.10/site-packages/plumpy/processes.py:1200: in execute
    self.loop.run_until_complete(self.step_until_terminated())
.venv/lib/python3.10/site-packages/nest_asyncio.py:92: in run_until_complete
    self._run_once()
.venv/lib/python3.10/site-packages/nest_asyncio.py:115: in _run_once
    event_list = self._selector.select(timeout)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <selectors.EpollSelector object at 0x7f6244131de0>, timeout = 100.0

    def select(self, timeout=None):
        if timeout is None:
            timeout = -1
        elif timeout <= 0:
            timeout = 0
        else:
            # epoll_wait() has a resolution of 1 millisecond, round away
            # from zero to wait *at least* timeout seconds.
            timeout = math.ceil(timeout * 1e3) * 1e-3
    
        # epoll_wait() expects `maxevents` to be greater than zero;
        # we want to make sure that `select()` can be called when no
        # FD is registered.
        max_ev = max(len(self._fd_to_key), 1)
    
        ready = []
        try:
>           fd_event_list = self._selector.poll(timeout, max_ev)
E           Failed: Timeout >240.0s

/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/selectors.py:469: Failed

It says the event loop is closed, it might caused because you use loop.get_event_loop() in the transport and run coroutine with run_until_complete, it may then close event he main event loop. It requires more investigation.

khsrali · 2025-01-10T16:20:40Z

Hi @unkcpz , yes they seem to be flaky. aiida_profile_clean would be nice, but which ones were the ones failing?

Edit: haha, now they pass, lol

agoscinski · 2025-01-10T17:45:34Z

It might work because the timeout is beyond the total testing time (set 40 minutes now) so no test is killed by the timeout. It might be because the signal method of pytest-timeout is used.

If the system supports the SIGALRM signal the signal method will be used by default [...]
The main issue to look out for with this method is that it may interfere with the code under test. If the code under test uses SIGALRM itself things will go wrong and you will have to choose the thread method.

https://pypi.org/project/pytest-timeout/

Further when asking a chatbot

Issue	Description
Signal handler not being called	If the event loop is not actively yielding, `SIGALRM` might not be handled.
Interrupted system calls	`SIGALRM` interrupts blocking system calls, causing `asyncio` tasks to fail with `OSError`.
Conflict with `asyncio.run()`	`asyncio.run()` closes the loop automatically, which makes it tricky to use custom signal handlers.

I was not able to verify this statement, but it is easy to test it by using thread as timeout method. So changing in the pyproject.toml timeout_method = thread

unkcpz

I give it another go through after the discussion on the interface.

I have three major requests:

It requires to understand why the change below code is required for make asyncssh work.

(old)
submit_command = f'bash {submit_script} > /dev/null 2>&1 & echo $!'
(new)
submit_command = f'(bash {submit_script} > /dev/null 2>&1 & echo $!) &'

It is the change in direct scheduler, so what if there are scheduler that use same bash command as direct one, do they all need to adapted?

After ♻️ Make Process.run async plumpy#272, the aiida-core should adjust the run method of process to async. I understand it is mainly to support AsyncTransport, but from the usage it is independent of this PR. Since it is a breaking change, it better to be separated as single commit out so easier to debug by commit (if something wrong in the future, we know it is because of transport or because of run method call change).
I think it requires quite less change in the code base, by only change the upload_calculation to await function and leave the inner function call unchanged.
The type annotation changes to str, Path is not related to this PR. I'd recommend to do it in a separated PR to future discuss whether we need use more generic type as I recommended PathLike or be more specific to be Union[Path, str, PathPosix].

The PR overall looks super nice and I think I was convinced that having *_sync is a good API design. Just for the engine part, if possible separate it out can make this PR only transport related and we are more confident to move forward.

@khsrali I know it is a bit frustrated that I still have some requests, I also understand that move changes to be independent PR can be tedious. I'd happy to help. If you don't mind I can do separate the PR out for type annotation and engine change. Or we can do a pair coding to do it together.

unkcpz · 2025-01-13T11:19:16Z

I was not able to verify this statement, but it is easy to test it by using thread as timeout method. So changing in the pyproject.toml timeout_method = thread

Thanks @agoscinski, in the doc it also says "This (using thread for timeout) is the surest and most portable method.", so let's give it a try.

unkcpz · 2025-01-13T11:20:29Z

Hi @unkcpz , yes they seem to be flaky. aiida_profile_clean would be nice, but which ones were the ones failing?
Edit: haha, now they pass, lol

Yes, I rerun and they passed, those were from test_restapi so I think it has nothing related to transport behavior of this PR.

khsrali · 2025-01-13T21:33:48Z

Dear @unkcpz
Thanks again for your review and concerns. I try my best to address your points:

It requires to understand why the change below code is required for make asyncssh work.
(old)
submit_command = f'bash {submit_script} > /dev/null 2>&1 & echo $!'
(new)
submit_command = f'(bash {submit_script} > /dev/null 2>&1 & echo $!) &'
It is the change in direct scheduler, so what if there are scheduler that use same bash command as direct one, do they all need to adapted?

Like we discussed before in person, this change is required in case a computer is set up with core.direct as scheduler and core.ssh_async as transport plugin.

The change is compatible with other schedulers. Because it submits an script and returns the job id. I do not see why this change would not be compatible with the other scheduler plugins.

After ♻️ Make Process.run async plumpy#272, the aiida-core should adjust the run method of process to async. I understand it is mainly to support AsyncTransport, but from the usage it is independent of this PR. Since it is a breaking change, it better to be separated as single commit out so easier to debug by commit (if something wrong in the future, we know it is because of transport or because of run method call change).

I think this is a good point. I will squash everything into two commits accordingly.

The type annotation changes to str, Path is not related to this PR. I'd recommend to do it in a separated PR to future discuss whether we need use more generic type as I recommended PathLike or be more specific to be Union[Path, str, PathPosix].
The PR overall looks super nice and I think I was convinced that having *_sync is a good API design. Just for the engine part, if possible separate it out can make this PR only transport related and we are more confident to move forward.

@khsrali I know it is a bit frustrated that I still have some requests, I also understand that move changes to be independent PR can be tedious. I'd happy to help. If you don't mind I can do separate the PR out for type annotation and engine change. Or we can do a pair coding to do it together.

Thank you for nice words, @unkcpz.
I agree that It would have been nice to initially separate that as another PR, forgive me for this. But now that we are here, I suggest let's move one, to save time.
The type annotation is not a crucial or any breaking change. Everything is down to TransportPath and it can easily be expanded without breaking anything.

danielhollas · 2025-01-16T22:36:51Z

The performance numbers look awesome?

Test Case: 25 WorkGraphs each transferring a few 1 MB files.

    core.ssh_async: 105 seconds
    core.ssh: 65 seconds

Out of curiosity, which Python version have you been using for the performance measurements?
There have been some performance improvements in the async module, so it would be interesting to test with Python 3.12 and 3.13, to see if the numbers above get better for the small-files case.

khsrali · 2025-01-17T09:03:39Z

    core.ssh_async: 105 seconds
    core.ssh: 65 seconds
Out of curiosity, which Python version have you been using for the performance measurements? There have been some performance improvements in the async module, so it would be interesting to test with Python 3.12 and 3.13, to see if the numbers above get better for the small-files case.

I used Python 3.12.3 for both cases.
Yes, potentially it might get better for 3.13 but since this is only a factor of two slower, I think it's not a big deal compare to the performance it brings about for the other cases :-)

agoscinski

I think we should also mimik test_ssh.py for async. I talked with @khsrali and he will do it in the next PR, since this PR already quite large.

Could you go to the TODOs that you have added and make issues or resolve them?

agoscinski · 2025-01-17T14:29:09Z

src/aiida/schedulers/plugins/direct.py

@@ -192,7 +192,7 @@ def _get_submit_command(self, submit_script):
            directory.
            IMPORTANT: submit_script should be already escaped.
        """
-        submit_command = f'bash {submit_script} > /dev/null 2>&1 & echo $!'
+        submit_command = f'(bash {submit_script} > /dev/null 2>&1 & echo $!) &'


The only thing I can see happen since is that the printed PID with echo $! is now printed after the next command because it is run concurrently now. This could be critical if we would rely on the printed PID, we want to read the PID from the echo command but get a different output. But as far as I checked, we do not rely on the printed PID but retrieve the pid using some long ps command (see _get_joblist_command).

The gist is I don't think it interferes.

agoscinski · 2025-01-17T14:39:24Z

src/aiida/transports/plugins/ssh_async.py

+    return value
+
+
+class AsyncSshTransport(AsyncTransport):


Have you tried to class AsyncSshTransport(AsyncTransport, SshTransport): ? That definitely would save a lot of code repetition but I am not 100% sure if that results in some weird behaviour.

Talked with @khsrali. Most function here are actually only temporary copied. They will change in the future to exploit the fact that functions are executed async to be more performant. Also async at the moment is a limited/simplified version of the ssh interface.

agoscinski · 2025-01-17T14:40:45Z

src/aiida/transports/transport.py

-__all__ = ('Transport',)
+__all__ = ('AsyncTransport', 'Transport', 'TransportPath')
+
+TransportPath = Union[str, Path, PurePosixPath]


separate PR would nicer, but at least mention this in the commit message if not.

src/aiida/transports/transport.py

agoscinski · 2025-01-17T14:46:02Z

src/aiida/transports/transport.py

+def path_to_str(path: TransportPath) -> str:
+    """Convert an instance of TransportPath = Union[str, Path, PurePosixPath] instance to a string."""
+    # We could check if the path is a Path or PurePosixPath instance, but it's too much overhead.
+    return str(path)


Hm... like this implemented I don't see the difference in just using str(path), is this somehow related to the type checker? But even then I don't understand. Did you want to do an instance check here and forgot it?

Talked with @khsrali. We agreed on removing it.

unkcpz · 2025-01-18T02:12:27Z

utils/dependency_management.py

Why changing permission of this devop file is needed? If there is reason behind, better to have a separate PR.

This was a mistake!
thanks for noticing, I'll take it off in the next commit.

khsrali · 2025-01-20T11:06:54Z

@agoscinski I applied your review. Also opened an issue for all TODOs as you requested: #6719

khsrali mentioned this pull request Nov 21, 2024

♻️ Allow for file uploads/downloads to be async #6079

Closed

khsrali marked this pull request as ready for review November 21, 2024 09:11

khsrali requested a review from agoscinski November 21, 2024 17:28

agoscinski requested changes Nov 22, 2024

View reviewed changes

unkcpz mentioned this pull request Nov 25, 2024

Refactoring: use tmp path fixture to mock remote and local for transport plugins #6627

Merged

2 tasks

agoscinski reviewed Nov 26, 2024

View reviewed changes

agoscinski requested changes Nov 26, 2024

View reviewed changes

src/aiida/transports/util.py Show resolved Hide resolved

unkcpz reviewed Nov 27, 2024

View reviewed changes

tests/plugins/test_factories.py Outdated Show resolved Hide resolved

unkcpz mentioned this pull request Dec 6, 2024

♻️ Make Process.run async aiidateam/plumpy#272

Merged

khsrali requested review from unkcpz and agoscinski December 11, 2024 17:07

khsrali mentioned this pull request Dec 19, 2024

Release 0.23.0 aiidateam/plumpy#299

Closed

unkcpz requested changes Dec 20, 2024

View reviewed changes

unkcpz mentioned this pull request Jan 8, 2025

Transport interface design #6686

Closed

unkcpz requested changes Jan 13, 2025

View reviewed changes

khsrali mentioned this pull request Jan 15, 2025

Engine: Async run #6708

Merged

khsrali changed the title ~~Transport & Engine: AsyncTransport plugin~~ Transport: AsyncTransport plugin Jan 16, 2025

khsrali force-pushed the async-transport branch from c3bf18b to 28d7971 Compare January 16, 2025 15:53

agoscinski requested changes Jan 17, 2025

View reviewed changes

unkcpz reviewed Jan 18, 2025

View reviewed changes

khsrali force-pushed the async-transport branch from 5347271 to 2a07b6f Compare January 20, 2025 09:28

AsyncSshTransport

91ea017

khsrali force-pushed the async-transport branch from 2a07b6f to 91ea017 Compare January 20, 2025 09:54

path_to_str -> str

b22338e

khsrali added the v2.7.0 label Jan 20, 2025



		# This is here for backwards compatibility
		Transport = BlockingTransport

Transport: AsyncTransport plugin #6626

Are you sure you want to change the base?

Transport: AsyncTransport plugin #6626

Conversation

khsrali commented Nov 21, 2024 • edited Loading

Test Results: Performance Comparisons

When core.ssh_async Outperforms

When core.ssh_async and core.ssh Are Comparable

codecov bot commented Nov 21, 2024 • edited Loading

Codecov Report

agoscinski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unkcpz commented Nov 24, 2024

khsrali commented Nov 25, 2024

khsrali commented Nov 25, 2024

Choose a reason for hiding this comment

agoscinski left a comment

Choose a reason for hiding this comment

khsrali commented Dec 5, 2024

khsrali commented Dec 5, 2024 • edited Loading

unkcpz commented Dec 5, 2024

khsrali commented Dec 5, 2024

agoscinski commented Dec 5, 2024

khsrali commented Dec 16, 2024

unkcpz left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unkcpz commented Jan 10, 2025 • edited Loading

khsrali commented Jan 10, 2025 • edited Loading

agoscinski commented Jan 10, 2025

unkcpz left a comment

Choose a reason for hiding this comment

unkcpz commented Jan 13, 2025

unkcpz commented Jan 13, 2025 • edited Loading

khsrali commented Jan 13, 2025 • edited Loading

danielhollas commented Jan 16, 2025

khsrali commented Jan 17, 2025 • edited Loading

agoscinski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unkcpz Jan 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khsrali commented Jan 20, 2025

Transport: `AsyncTransport` plugin #6626

Transport: `AsyncTransport` plugin #6626

khsrali commented Nov 21, 2024 •

edited

Loading

When `core.ssh_async` Outperforms

When `core.ssh_async` and `core.ssh` Are Comparable

codecov bot commented Nov 21, 2024 •

edited

Loading

khsrali commented Dec 5, 2024 •

edited

Loading

unkcpz left a comment •

edited

Loading

unkcpz commented Jan 10, 2025 •

edited

Loading

khsrali commented Jan 10, 2025 •

edited

Loading

unkcpz commented Jan 13, 2025 •

edited

Loading

khsrali commented Jan 13, 2025 •

edited

Loading

khsrali commented Jan 17, 2025 •

edited

Loading

unkcpz Jan 18, 2025 •

edited

Loading