[Bug]: WiFi / Network connectivity issues with 2.4.13+ #5458

koliha · 2024-11-26T17:50:23Z

Hardware

T-Beam, Heltec V3, Station G2

Firmware Version

2.5.13

Description

After a period of time (6-18hrs, no longer than 24hr), WiFi enabled ESP32 based devices on are losing network connectivity (not physical connectivity, just no longer passing traffic). I have tested this with 2.5.13 and 2.5.14, but not older 2.5.x builds. I'm going to setup a heltec v3 for testing so I can capture console traffic. I'll post that here when I'm able to do so.

When the issue occurs:

You can de-auth the node or power off the access point that it's connected to and it reconnects to wifi (same or another AP)
When it reconnects you can see the node request DHCP and get a response (dhcp server logs)
The DHCP exchange is successfully sniffed by my Ubiquiti setup
Traffic logs (ubiquiti) show WiFi rx but no tx for the node during this issue

I am seeing this on:

Two different wireless networks using Ubiquiti hardware
One wireless network running off a Netgear Nighthawk
Station G2, Heltec V3, and a T-Beam

Relevant log output

No response

koliha · 2024-11-26T22:00:21Z

This is what I see from the WiFi side -- this pattern repeats.

It's not related to signal, here is a closer AP with the same pattern/behavior:

If I check my DHCP logs for today I see a DHCP lease renewals up until the wifi drop at 2:57pm. The 3:16pm wifi drop is seen as well, but nothing for the 3:24/26/27 reconnects.

Not sure if I can set a static IP, but that's my next step in attempting to troubleshoot.

garthvh · 2024-11-27T02:17:14Z

Are you connected to the public mqtt server? Do you have device logs?

CTassisF · 2024-11-27T02:26:44Z

Have you tried the latest "Bleeding" firmware? A bug affecting Wi-Fi connectivity (reported in #5387) was recently fixed in PR #5439.

koliha · 2024-11-27T02:41:38Z

Are you connected to the public mqtt server? Do you have device logs?

No device logs. Using a public MQTT server (used by tens of others), but not "the" public MQTT server. MQTT server resides on my local network (connecting via private address).

Have you tried the latest "Bleeding" firmware? A bug affecting Wi-Fi connectivity (reported in #5387) was recently fixed in PR #5439.

This seems extremely relevant. Only using pre-built/published builds, so I'll probably wait for a release to test.

I have a static IP set as of 8pm on the problematic node and will report back the results. Based on #5387 I would assume that it's still going to occur with a static. Rather than try to set something up to collect logs, I'll probably wait out the next alpha to see if it solves.

CTassisF · 2024-11-27T02:48:42Z

There were recent changes to MQTT servers in "private" (RFC1918) networks (discussed in #5203) that might be related to your issue. However, if I were to make an educated guess, most (if not all) of these Wi-Fi issues were likely caused by the memory leaks fixed in MQTT::onReceive.

koliha · 2024-11-27T02:52:12Z

There were recent changes to MQTT servers in "private" (RFC1918) networks (discussed in #5203) that might be related to your issue. However, if I were to make an educated guess, most (if not all) of these Wi-Fi issues were likely caused by the memory leaks fixed in MQTT::onReceive.

I am assuming the same re: MQTT::onReceive. I'm using public dns and IP and relying on a nat mirroring rule to route it back inside, so the RFC1918 issue shouldn't impact, but still great info to know.

fifieldt · 2024-11-27T10:43:46Z

Potentially also relevant: we were power saving even when wifi was connected. That's fixed now: #5443

koliha · 2024-11-27T13:45:54Z

Potentially also relevant: we were power saving even when wifi was connected. That's fixed now: #5443

Saw that but it's not related.

Static IP has kept it from falling off completely overnight, but it's disconnecting+reconnecting to wifi over and over and seems a little wonky overall (pretty sure it's not beaconing device metrics to MQTT consistently). 2 minutes of no ping responses, responds for 5-7 seconds, then back to no response.

I'll wait for a release that incorporates #5387 before I attempt to troubleshoot further.

leshniak · 2024-11-27T23:05:04Z

I'm also experiencing the issue on 2.5.14.f2ee0df and latest beta. Waiting for the mentioned fixes.

LowVoltagePirate · 2024-11-29T07:56:50Z

I've build the files from the master repo and can confirm that the issue seems fixed, my t-beam is now online >48 without any issues

leshniak · 2024-11-29T08:04:50Z

I’ve also decided to make my own build yesterday. So far so good, it maintains a stable WiFi connection and heap usage stabilized at 90% after 24h.

leshniak · 2024-11-29T16:07:44Z

Hmm, the same issue sometimes occurs while connecting to the node via web interface and TCP. The device rebooted itself after few minutes. No logs for now unfortunately, due to remote location.

I have ~70 nodes in my NodeDB, if that matters.

Edit: looks like all is fine if I connect to the node shortly after the reboot.

CTassisF · 2024-11-29T18:38:05Z

Meshtastic Firmware 2.5.15.79da236 Alpha was released with fixes for #5387.

fifieldt · 2024-11-29T23:07:58Z

Fixed by #5387

Matzebhv · 2024-12-01T15:51:40Z

Had to reopen, Firmware 2.5.15.79da236 does not fix this issue. My 2 Supremes are disconnectig after a couple of hours.
Same behavior as described here #5458 (comment)

commanderts · 2024-12-02T00:57:35Z

Same problem here with an Heltec v3...

koliha · 2024-12-02T14:22:10Z

Also seeing the same behavior with my Station G2s on 2.5.15. I will try to pull logs later today.

Xaositek · 2024-12-02T21:35:56Z

I notice you're on a UniFi Network, here's my settings for you to compare against. I'm running UniFi Network 9.0.92, Firmware 6.7.9 and physical devices are two U6 Pros and a U6 In-Wall - I have not yet been able to reproduce any issues. I have a Heltec V3 and a T-Beam (Original) connected via WiFi, they run for weeks and zero issues; my Heltec V3 is on 2.5.15, flashed 3 days ago and no problems.

koliha · 2024-12-02T22:06:00Z

I notice you're on a UniFi Network, here's my settings for you to compare against. I'm running UniFi Network 9.0.92, Firmware 6.7.9 and physical devices are two U6 Pros and a U6 In-Wall - I have not yet been able to reproduce any issues. I have a Heltec V3 and a T-Beam (Original) connected via WiFi, they run for weeks and zero issues; my Heltec V3 is on 2.5.15, flashed 3 days ago and no problems.

Thanks for the reply. I'm seeing this on two fairly complex networks w/Ubiquiti as well as a very simple cable modem + netgear nighthawk consumer router setup. Reportedly the issue does not occur if MQTT is disabled - do you have your node connected via mqtt?

I'm currently timing the issue, re-checking the behavior post-patch, and capturing logs. After that I'll try turning MQTT off and see if it makes any difference. It would be nice to narrow it down. I highly doubt it's actually something to do with the code around networking.

Xaositek · 2024-12-02T22:13:10Z

Yes, It is connected via MQTT to my local MQTT server, not public MQTT.

It is using a DNS name which will internally resolve to a local IP.

Matzebhv · 2024-12-02T22:31:19Z

It has nothing to do with unifi ore something. No special setup here, one node is connected to my fritzbox and the other is connected to my workplace with extreme enterprise ap. Long fast ist uplink only and 3 channels with moderate traffic are setup with up/downlink. If i disable mqtt the node will stay online.

garthvh · 2024-12-02T23:44:59Z

It has nothing to do with unifi ore something. No special setup here, one node is connected to my fritzbox and the other is connected to my workplace with extreme enterprise ap. Long fast ist uplink only and 3 channels with moderate traffic are setup with up/downlink. If i disable mqtt the node will stay online.

If your mqtt is on the default topic that is likely too much traffic for the node to handle.

leshniak · 2024-12-03T10:07:31Z

If your mqtt is on the default topic that is likely too much traffic for the node to handle.

Mine is connecting to a local mosquitto instance (not bridged) and behaves the same way.

koliha · 2024-12-03T14:18:49Z

If your mqtt is on the default topic that is likely too much traffic for the node to handle.

Mine is connecting to a local mosquitto instance (not bridged) and behaves the same way.

Similar. One node is connecting to self-hosted mosquitto on LAN, the other is connecting to the same through the internet (NAT). Decently low traffic overall (way less than the official/dev server)

leshniak · 2024-12-03T14:38:01Z

The attached log from mosquitto shows how unstable the connection is. Then I've triggered a reboot via LoRa and it's all gone.

I think this looks a bit weird:

1733217530: New connection from 192.168.5.85:55328 on port 1883.
1733217530: New client connected from 192.168.5.85:55328 as !2f93dc9c (p2, c1, k15).
1733217554: Client !2f93dc9c has exceeded timeout, disconnecting.
1733217561: New connection from 192.168.5.85:55329 on port 1883.
1733217561: New client connected from 192.168.5.85:55329 as !2f93dc9c (p2, c1, k15).
1733217584: Client !2f93dc9c has exceeded timeout, disconnecting.
1733217589: New connection from 192.168.5.85:50628 on port 1883.
1733217589: New client connected from 192.168.5.85:50628 as !2f93dc9c (p2, c1, k15).
1733217678: New connection from 192.168.5.85:51534 on port 1883.
1733217678: Client !2f93dc9c already connected, closing old connection.
1733217678: New client connected from 192.168.5.85:51534 as !2f93dc9c (p2, c1, k15).

I have a Wemos D1 Mini placed in the exact same location, with ESPHome firmware and it's super stable despite having a ~10 dBm weaker WiFi signal.

log.txt

garthvh · 2024-12-03T15:29:24Z

Is the mqtt JSON functionality being used? Who is hosting the private brokers?

koliha · 2024-12-03T15:30:35Z

Is the mqtt JSON functionality being used? Who is hosting the private brokers?

Not in any of my use cases. I'm hosting the mqtt server locally (ubuntu vm, mosquitto). Seeing it w/node on LAN as well as one that is connecting through NAT/internet.

leshniak · 2024-12-04T17:40:41Z

Currently I'm testing the connection through WiFi repeater and with different network settings on both ends - maybe it doesn't like my access point and/or vice versa. I'll let you know about the results in next couple of days.

(testing also #5490 case)

koliha · 2024-12-05T21:53:45Z

I just wanted to provide an update as it's been a couple of days. During my initial tests, both nodes died at 36 hours. Currently, both have been up for 65 hours. The only thing that is different is that I'm using PRTG to ping and scrape json data (http) every 60 seconds.

I'm going to turn off the ping/json traffic, reboot both nodes, and see if I can reproduce the original issue. I'll circle back here in ~48hr.

EnGamma · 2024-12-07T18:49:55Z

I'm seeing this since updating fw from 2.3.10 to 2.5.11 on my 4 TLORA_V2_1_1P6 devices. Can't reliably connect by from phone via WiFi to fixed IP of nodes. Sometimes switching between nodes and back connects, but often not. Sometimes switching between WiFi SSIDs (all same network) fixes, but often not.

koliha · 2024-12-09T22:20:38Z

Thank you all for your patience while I attempted to reproduce this. The original test nodes have been up for 4 days solid and the remote node (on a completely different "consumer" network setup) has been up and going for a little over 2 days now.

Subjectively, I feel that this issue can be closed out as resolved. I was able to reproduce this easily on 2.4.13/2.4.14. With 2.4.15 I had an MQTT/network drop-off initially (with two nodes simultaneously), but I haven't been able to reproduce the issue since. T1000-E stopped randomly rebooting itself with the fixes in 2.5.15 also.

It would be good to note on the github releases page that this mqtt/network issue exists in 2.3.10 - 2.3.14 (apparently).

Matzebhv · 2024-12-09T22:27:20Z

Subjectively, I feel that this issue can be closed out as resolved.
OK? What is the solution here?

koliha · 2024-12-09T22:30:27Z

Subjectively, I feel that this issue can be closed out as resolved.
OK? What is the solution here?

Upgrade to 2.5.15.

Matzebhv · 2024-12-09T23:03:16Z

Subjectively, I feel that this issue can be closed out as resolved.
OK? What is the solution here?

Upgrade to 2.5.15.

Really?? Whe are on .15. This problem is not solved. Please reopen, this really sucks.

leshniak · 2024-12-12T15:45:34Z

The issue still exists. I've managed to make it more stable by adding a esp_wifi_repeater but it's a workaround. Now my theory is that something is clogging (buffer? memory leak?) when TCP segments have to be retransmitted, because the same thing occurs while using the Web client.

Matzebhv · 2024-12-12T19:40:14Z

Same as here -> #5549

Matzebhv · 2024-12-12T19:45:48Z

What is really bad, you have a couple of people in this issue with this problem. On person say "it is fixed for me" and zapp -> closed.

thebentern · 2024-12-12T19:55:37Z

What is really bad, you have a couple of people in this issue with this problem. On person say "it is fixed for me" and zapp -> closed.

The original reporter of this issue said that their issue was fixed for them by upgrading. You have inserted yourself into this issue without any background context to whether or not you are even experiencing the same scenario, on the same hardware, or any reproduction steps. I think you should rework your approach here to be less combative and demanding.

koliha · 2024-12-13T18:38:30Z

Just as a final update, both of the original nodes that had issues have been up for almost 8 days solid now. I've started upgrading remote nodes (~100mi away) with 2.5.15/.16 and have not had any issues with them either. Mixture of ESP32 and NRF hardware. I'm also tracking heap free on one of the two original problematic nodes and haven't seen anything that screams memory leak either.

Thanks for the assistance/patience on this one.

koliha · 2024-12-16T16:22:55Z

Spoke too soon, not solved. Looks like a memory leak ( see #5549 ). #5559 was opened for the same reason it seems. Posted additional details/comments there since it's open and this one has been closed.

xhci · 2024-12-18T17:04:44Z

I also see this on a Station G2. I have the power supply set to cycle every x hours as a workaround.

koliha added the bug Something isn't working label Nov 26, 2024

fifieldt closed this as completed Nov 29, 2024

leshniak mentioned this issue Dec 2, 2024

[Bug]: Watchdog triggered while using HTTP API #5490

Open

thebentern reopened this Dec 2, 2024

liquidraver mentioned this issue Dec 4, 2024

[Bug]: Node rebooting from receiving invalid packet #5508

Closed

thebentern closed this as completed Dec 9, 2024

Matzebhv mentioned this issue Dec 12, 2024

[Bug]: Memory leaks in MQTT::onReceive #5549

Closed

Matzebhv mentioned this issue Dec 13, 2024

[Bug]: WiFi permanently disconnects after some time when using MQTT, reboot required to fix #5559

Closed

thebentern reopened this Dec 16, 2024

esev mentioned this issue Dec 17, 2024

Refactor MQTT::onReceive to reduce if/else nesting #5592

Merged

thebentern closed this as completed in #5592 Dec 19, 2024

[Bug]: WiFi / Network connectivity issues with 2.4.13+ #5458

[Bug]: WiFi / Network connectivity issues with 2.4.13+ #5458

Comments

koliha commented Nov 26, 2024

Category

Hardware

Firmware Version

Description

Relevant log output

koliha commented Nov 26, 2024

garthvh commented Nov 27, 2024

CTassisF commented Nov 27, 2024

koliha commented Nov 27, 2024 • edited Loading

CTassisF commented Nov 27, 2024

koliha commented Nov 27, 2024

fifieldt commented Nov 27, 2024

koliha commented Nov 27, 2024 • edited Loading

leshniak commented Nov 27, 2024

LowVoltagePirate commented Nov 29, 2024

leshniak commented Nov 29, 2024

leshniak commented Nov 29, 2024 • edited Loading

CTassisF commented Nov 29, 2024

fifieldt commented Nov 29, 2024

Matzebhv commented Dec 1, 2024

commanderts commented Dec 2, 2024

koliha commented Dec 2, 2024 • edited Loading

Xaositek commented Dec 2, 2024

koliha commented Dec 2, 2024

Xaositek commented Dec 2, 2024

Matzebhv commented Dec 2, 2024

garthvh commented Dec 2, 2024 • edited Loading

leshniak commented Dec 3, 2024 • edited Loading

koliha commented Dec 3, 2024

leshniak commented Dec 3, 2024 • edited Loading

garthvh commented Dec 3, 2024 • edited Loading

koliha commented Dec 3, 2024 • edited Loading

leshniak commented Dec 4, 2024 • edited Loading

koliha commented Dec 5, 2024

EnGamma commented Dec 7, 2024 • edited Loading

koliha commented Dec 9, 2024 • edited Loading

Matzebhv commented Dec 9, 2024

koliha commented Dec 9, 2024

Matzebhv commented Dec 9, 2024

leshniak commented Dec 12, 2024

Matzebhv commented Dec 12, 2024

Matzebhv commented Dec 12, 2024

thebentern commented Dec 12, 2024

koliha commented Dec 13, 2024

koliha commented Dec 16, 2024

xhci commented Dec 18, 2024

koliha commented Nov 27, 2024 •

edited

Loading

koliha commented Nov 27, 2024 •

edited

Loading

leshniak commented Nov 29, 2024 •

edited

Loading

koliha commented Dec 2, 2024 •

edited

Loading

garthvh commented Dec 2, 2024 •

edited

Loading

leshniak commented Dec 3, 2024 •

edited

Loading

leshniak commented Dec 3, 2024 •

edited

Loading

garthvh commented Dec 3, 2024 •

edited

Loading

koliha commented Dec 3, 2024 •

edited

Loading

leshniak commented Dec 4, 2024 •

edited

Loading

EnGamma commented Dec 7, 2024 •

edited

Loading

koliha commented Dec 9, 2024 •

edited

Loading