Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSL EOF leads to persistent non-delivery of messages #750

Closed
gdt opened this issue Oct 1, 2023 · 5 comments · Fixed by #797
Closed

SSL EOF leads to persistent non-delivery of messages #750

gdt opened this issue Oct 1, 2023 · 5 comments · Fixed by #797
Labels
Status: Available No one has claimed responsibility for resolving this issue. Type: Bug

Comments

@gdt
Copy link

gdt commented Oct 1, 2023

I am using 1.6.1, NetBSD 9 amd64, python 3.10. The broker is up-to-date mosquitto, on a remote NetBSD 9 amd64 system. I am using 8883 with a real certificate, and generally everything works well.

I have a script which polls a UPS and publishes messages usually once a minute, but at intervals of up to 1s if something interesting has happened. On that system, I took the WAN interface down, releasing the lease, removing addresses and default route, and ifconfig down, waited an hour and brought it back. Obviously I won't have data in the meantime but I expect it to recover.

In the log (stdout/stderr) of the program I see (with the json shortened; there are actually 14 keys):

SUBMITTED time {'time': 1696117812.470824, 'runtime': 5940.0}
PUBLISHED 16075
SUBMITTED time {'time': 1696117873.4957478, 'runtime': 5940.0}
Exception in thread Thread-1 (_thread_main):
Traceback (most recent call last):
  File "/usr/pkg/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/pkg/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 3591, in _thread_main
    self.loop_forever(retry_first_connection=True)
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 1756, in loop_forever
    rc = self._loop(timeout)
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 1181, in _loop
    rc = self.loop_write()
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 1577, in loop_write
    rc = self._packet_write()
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 2464, in _packet_write
    write_length = self._sock_send(
  File "/usr/pkg/lib/python3.10/site-packages/paho/mqtt/client.py", line 649, in _sock_send
    return self._sock.send(buf)
  File "/usr/pkg/lib/python3.10/ssl.py", line 1206, in send
    return self._sslobj.write(data)
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2426)
PUBLISHED 16076
SUBMITTED time {'time': 1696117934.524036, 'runtime': 5940.0}
SUBMITTED time {'time': 1696117994.8527699, 'runtime': 5940.0}

So the "EOF" looks like write failed. Probably this is "no route to host". However I don't get a disconnect callback. And I did get a publish callback! But surely the broker didn't get the message.
Then, future publish calls happen, with no exceptions. But no publish callbacks.

Obviously I can write a watchdog to close/open if I don't get pubacks to python.

But it seems that the ssl error should be caught and the connection should be judged non-functional. Or at least some later publish should cause failure and to close the connection. The user program should not get an exception thrown other than in response to an API call and then only documented exceptions.

I don't think this is related to threading/locking, as the program only did one publish call, and without the network being down, runs for months.

@github-actions github-actions bot added the Status: Available No one has claimed responsibility for resolving this issue. label Oct 1, 2023
@gdt
Copy link
Author

gdt commented Oct 4, 2023

(I did write a watchdog, but calling close and then reopen did not result in messages flowing.)

Reading client.py, I see _packet_write calling _sock_send on line 2506. A number of exceptions are caught, but there is no general catc, and obviously the one above is not caught. It seems obvious that an exception that has not been considered carefully risks leaving the socket in a broken state, as happens to me. And therefore that other exceptions should lead to the close/retry step.

@warped-rudi
Copy link

Are there any updates on this issue? I'm seeing it as well when the internet connection gets lost temporarily.

@PierreF
Copy link
Contributor

PierreF commented Jan 7, 2024

It seems you are using loop_start, and some exception aren't correctly handled which cause the loop thread to crash. Once crashed no more processing it done.

I've made a PR (#797) which should fix your problem.

@gdt
Copy link
Author

gdt commented Jan 8, 2024

Thanks for addressing this. I am in fact using loop_start. Modulo variables to set my username/etc. and to enable TLS, the code is at https://github.com/merbanan/rtl_433/blob/master/examples/rtl_433_mqtt_relay.py . But, aside from this failure, everything else is fine. I am grabbing the patch in the PR, will patch my system, and restart it.

@warped-rudi
Copy link

I gave this patch a try and first tests were positive. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Available No one has claimed responsibility for resolving this issue. Type: Bug
Projects
None yet
4 participants