Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset network_cli command_timeout (alarm) on data recv'd #35817

Closed
wants to merge 7 commits into from

Conversation

bdowling
Copy link
Contributor

@bdowling bdowling commented Feb 7, 2018

SUMMARY

network_cli currently depends on the command_timeout variable to set the length of time that it waits for a command to FULLY complete. But it does not take into account if the device is actually sending output still.

Why not instead reset this timer when data is received?

This PR proposes that the alarm() timer should simply be reset whenever data is received. I did this in a simple way by seeing if the handler for SIGALRM was set.

FYI, with this change, I was able to set command_timeout=1 in the config and still run a number of long-winded test plays successfully including ones that previously failed with timeout set to 10 (due to the show running all that took 23 seconds on my DUT).

ISSUE TYPE
  • Bugfix Pull Request
COMPONENT NAME

network_cli

ANSIBLE VERSION
devel
ADDITIONAL INFORMATION

Extending COMMAND_TIMEOUT is a oft recommended fix to these timeout issues, but setting it to some large value is impractical because it results in any non-responsive device taking that much extra time to abort the command.

@bdowling bdowling force-pushed the fix_network_timeout branch from 7a1234a to bd4447e Compare February 7, 2018 03:25
@ansibot
Copy link
Contributor

ansibot commented Feb 7, 2018

@ansibot ansibot added bugfix_pull_request needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. needs_triage Needs a first human triage before being processed. networking Network category support:core This issue/PR relates to code supported by the Ansible Engineering Team. labels Feb 7, 2018
@bdowling bdowling force-pushed the fix_network_timeout branch from bd4447e to 2c81522 Compare February 7, 2018 03:36
@ansibot ansibot removed the needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. label Feb 7, 2018
@samdoran samdoran removed the needs_triage Needs a first human triage before being processed. label Feb 8, 2018
@gundalow
Copy link
Contributor

@bdowling Is this causing you issues at the moment, ie do you have playbooks that fail at the moment with the default timeouts?

@bdowling
Copy link
Contributor Author

Yes, a very simple play using ios_config triggers this in part because it calls show running-config all. Now I may have a slightly slow test device (csr1000v) on my hands, but this should still be considered a bug. As noted, when I switch to this PR branch it completes just fine, resetting the alarm timer when data is still being received is the "right thing to do" ™️.

---

- hosts: onerouter
  gather_facts: no
  connection: network_cli

  tasks:
  - name: acl
    ios_config:
      lines:
        - access-list 99 permit 192.168.1.100

Result:

TASK [acl] **************************************************************************************************************************************************
task path: /home/bdowling/src/netdev/plays/bug-acl.yaml:8
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.module_utils.connection.ConnectionError: timeout trying to send command: show running-config all
fatal: [bd-csr1000v-00]: FAILED! => {"changed": false, "failed": true, "module_stderr": "Traceback (most recent call last):\n  File \"/tmp/ansible_LMpMhE/ansible_module_ios_config.py\", line 551, in <module>\n    main()\n  File \"/tmp/ansible_LMpMhE/ansible_module_ios_config.py\", line 454, in main\n    config, have_banners = get_running_config(module, config)\n  File \"/tmp/ansible_LMpMhE/ansible_module_ios_config.py\", line 352, in get_running_config\n    contents = get_config(module, flags=flags)\n  File \"/tmp/ansible_LMpMhE/ansible_modlib.zip/ansible/module_utils/network/ios/ios.py\", line 117, in get_config\n  File \"/tmp/ansible_LMpMhE/ansible_modlib.zip/ansible/module_utils/connection.py\", line 146, in __rpc__\nansible.module_utils.connection.ConnectionError: timeout trying to send command: show running-config all\n", "module_stdout": "", "msg": "MODULE FAILURE", "rc": 1}
        to retry, use: --limit @/home/bdowling/.ansible/retry/bug-acl.retry

@bdowling
Copy link
Contributor Author

With some additional (local) debugging turned on, you can see that the show running all is taking a good amount of time and ansible cuts the connection 10 seconds in a partial receipt of the response:

2018-02-14 16:15:44,253 p=1766 u=bdowling |    1773 1518624944.25329: CLI-PROMPT: looking for prompts in [21]| show running-config all\r\nBuilding configuration...\r\n
2018-02-14 16:15:45,019 p=1766 u=bdowling |    1773 1518624945.01909: CLI-PROMPT: looking for prompts in [22]| \r\nCurrent configuration with default configu
rations exposed : 223507 bytes\r\n!\r\n! Last configuration change at 15:46:59 UTC Wed Feb 14 2018 by neoadmin\r\n!\r\nversion 15.5\r\ndownward-compatible-co
nfig 15.5\r\nno service log backtrace\r\nservice config\r\nno service e

...
2018-02-14 16:15:53,767 p=1766 u=bdowling |    1773 1518624953.76767: CLI-PROMPT: looking for prompts in [437]| slot mma bay notice process 1 module R1 level forwarding-manager\r\nplatform trace runtime slot mmon bay notice process 0 module F0 level forwarding-manager\r\nplatform trace runtime slot mmon bay notice process 0 module F1 level forwarding-manager\r\nplatform t
2018-02-14 16:15:54,062 p=1766 u=bdowling |  command timeout triggered, timeout value is 10 secs
2018-02-14 16:15:54,062 p=1766 u=bdowling |    1773 1518624954.06231: closing ssh connection to device
2018-02-14 16:15:54,062 p=1766 u=bdowling |    1773 1518624954.06262: firing event: on_close_shell()
2018-02-14 16:15:54,062 paramiko.transport [chan 0] EOF sent (0)
2018-02-14 16:15:54,063 p=1766 u=bdowling |    1773 1518624954.06330: cli session is now closed
2018-02-14 16:15:54,063 p=1766 u=bdowling |    1773 1518624954.06355: ssh connection has been closed successfully
2018-02-14 16:15:54,063 p=1766 u=bdowling |  shutdown complete
2018-02-14 16:15:54,064 p=1766 u=bdowling |  shutdown complete
2018-02-14 16:15:54,086 p=1756 u=bdowling |    1765 1518624954.08666: done communicating
2018-02-14 16:15:54,087 p=1756 u=bdowling |    1765 1518624954.08696: done with local.exec_command()
2018-02-14 16:15:54,087 p=1756 u=bdowling |    1765 1518624954.08719: _low_level_execute_command() done: rc=1, stdout=, stderr=Traceback (most recent call last):
  File "/tmp/ansible_qNPUQd/ansible_module_ios_config.py", line 551, in <module>
    main()
  File "/tmp/ansible_qNPUQd/ansible_module_ios_config.py", line 454, in main
    config, have_banners = get_running_config(module, config)
  File "/tmp/ansible_qNPUQd/ansible_module_ios_config.py", line 352, in get_running_config
    contents = get_config(module, flags=flags)
  File "/tmp/ansible_qNPUQd/ansible_modlib.zip/ansible/module_utils/network/ios/ios.py", line 117, in get_config
  File "/tmp/ansible_qNPUQd/ansible_modlib.zip/ansible/module_utils/connection.py", line 146, in __rpc__
ansible.module_utils.connection.ConnectionError: timeout trying to send command: show running-config all

@ansibot ansibot added stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. and removed stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. labels Feb 22, 2018
@bdowling bdowling changed the title Fix network_cli command_timeout by extending on data recv'd Reset network_cli command_timeout (alarm) on data recv'd Feb 24, 2018
@privateip
Copy link
Contributor

@bdowling your proposed change will break if the prompt is not properly identified and the session will be left in a hung state. It is precisely that situation that the current code path protects.

@bdowling
Copy link
Contributor Author

bdowling commented Feb 26, 2018

@privateip previously this may have been the case. But now with the great work from @ganeshrn and in particular the merged fix in #35439 network_cli no longer sends any commands to try to wake up the prompt (it just uses the last seen).

So in this PR, even in cases where the prompt is not found, the timeout will still occur because the device stops sending any output. I hope that you will see from the tests below that I have addressed your concerns. Network delays, device performance, etc can all affect how fast the device can return all the data we are asking of it. Using a timeout that is reset whenever the device outputs data is better than abruptly disconnecting a working, but slow connection short.

To prove out this theory, I have tested this in my branch in two ways.

  1. I broke the enable password: prompt by fudging the regex. Which results in a timeout:
AnsibleConnectionFailure: unable to elevate privilege to enable mode, at prompt [None] with error: timeout trying to send command: enable
  1. I patched network_cli to mess with the stdout_re after I send a special command:
diff --git a/lib/ansible/plugins/connection/network_cli.py b/lib/ansible/plugins/connection/network_cli.py
index 1e14949..03adbca 100644
--- a/lib/ansible/plugins/connection/network_cli.py
+++ b/lib/ansible/plugins/connection/network_cli.py
@@ -388,6 +388,11 @@ class Connection(ConnectionBase):
         if signal_handler == signal.SIG_DFL:
             signal_handler = None

+
+        if command and 'break prompts' in command:
+            display.debug("CLI-RECEIVE Breaking prompt matching")
+            self._terminal.terminal_stdout_re = [re.compile(br'NeverMatchThis')]
+
         while True:
             data = self._ssh_shell.recv(256)

With play:

---

- hosts: all
  gather_facts: no
  connection: network_cli

  tasks:
  - name: echo break prompts
    ios_command:
      authorize: yes
      commands: echo break prompts

Which also results in the expected timeout:

ansible.module_utils.connection.ConnectionError: timeout trying to send command: echo break prompts

@privateip
Copy link
Contributor

@bdowling just a quick update. I talked with @ganeshrn and he is going to do some testing of this patch and we will try to pull it in. no promises but we will try our best.

@privateip privateip self-requested a review March 1, 2018 02:46
@privateip privateip added this to the 2.5.0 milestone Mar 1, 2018
@privateip privateip requested a review from ganeshrn March 1, 2018 02:47
@ansibot
Copy link
Contributor

ansibot commented Mar 1, 2018

cc @gdpak
click here for bot help

@ansibot ansibot added bug This issue/PR relates to a bug. and removed bugfix_pull_request labels Mar 2, 2018
@ganeshrn
Copy link
Member

ganeshrn commented Mar 8, 2018

I have done some manual testing and added integration and unit test as part of PR #37185 to test this patch. Seems to be working fine.

@bdowling bdowling force-pushed the fix_network_timeout branch from d675918 to b5ddd7e Compare October 3, 2018 15:43
@ansibot
Copy link
Contributor

ansibot commented Oct 3, 2018

@ansibot ansibot removed merge_commit This PR contains at least one merge commit. Please resolve! needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. labels Oct 3, 2018
@bdowling bdowling force-pushed the fix_network_timeout branch from e755185 to 3f15793 Compare October 3, 2018 16:50
@bdowling bdowling force-pushed the fix_network_timeout branch from 3f15793 to b114727 Compare October 3, 2018 17:00
@ansibot ansibot removed the needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. label Oct 6, 2018
@ansibot ansibot added stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. stale_review Updates were made after the last review and the last review is more than 7 days old. labels Oct 14, 2018
@ansibot ansibot added needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. and removed stale_review Updates were made after the last review and the last review is more than 7 days old. labels Oct 24, 2018
@ansibot ansibot added the docsite This issue/PR relates to the documentation website. label Nov 5, 2018
@ansibot ansibot added the support:network This issue/PR relates to code supported by the Ansible Network Team. label Nov 26, 2018
@samccann
Copy link
Contributor

@ganeshrn - is there more work to go on this PR? (as in does it still need more revisions or perhaps just a rebase?)

@gundalow gundalow removed this from the 2.7.0 milestone Mar 5, 2019
@samccann
Copy link
Contributor

samccann commented Jun 5, 2019

@ganeshrn @ikhan2010 - I don't see evidence of a response from @bdowling in some time. Can we either close this PR or have someone internal take it over, rebase, and merge? Thanks.

@ansibot
Copy link
Contributor

ansibot commented Jun 29, 2019

@ansibot ansibot added collection Related to Ansible Collections work collection:ansible.netcommon labels Apr 29, 2020
@samccann
Copy link
Contributor

Hi and thank you for your contribution to Ansible. With the release of Ansible 2.10, this code has been updated significantly and moved to https://github.com/ansible-collections/ansible.netcommon

If you feel this is still an applicable change, please open a PR in that repository.

@samccann samccann closed this Sep 25, 2020
@ansible ansible locked and limited conversation to collaborators Oct 23, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
affects_2.6 This issue/PR affects Ansible v2.6 bug This issue/PR relates to a bug. collection:ansible.netcommon collection Related to Ansible Collections work docs This issue/PR relates to or includes documentation. docsite This issue/PR relates to the documentation website. has_issue needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. networking Network category stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. support:core This issue/PR relates to code supported by the Ansible Engineering Team. support:network This issue/PR relates to code supported by the Ansible Network Team. test This PR relates to tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants