Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: WiFi / Network connectivity issues with 2.4.13+ #5458

Closed
koliha opened this issue Nov 26, 2024 · 41 comments · Fixed by #5592
Closed

[Bug]: WiFi / Network connectivity issues with 2.4.13+ #5458

koliha opened this issue Nov 26, 2024 · 41 comments · Fixed by #5592
Labels
bug Something isn't working

Comments

@koliha
Copy link

koliha commented Nov 26, 2024

Category

WiFi

Hardware

T-Beam, Heltec V3, Station G2

Firmware Version

2.5.13

Description

After a period of time (6-18hrs, no longer than 24hr), WiFi enabled ESP32 based devices on are losing network connectivity (not physical connectivity, just no longer passing traffic). I have tested this with 2.5.13 and 2.5.14, but not older 2.5.x builds. I'm going to setup a heltec v3 for testing so I can capture console traffic. I'll post that here when I'm able to do so.

When the issue occurs:

  • You can de-auth the node or power off the access point that it's connected to and it reconnects to wifi (same or another AP)
  • When it reconnects you can see the node request DHCP and get a response (dhcp server logs)
  • The DHCP exchange is successfully sniffed by my Ubiquiti setup
  • Traffic logs (ubiquiti) show WiFi rx but no tx for the node during this issue

I am seeing this on:

  • Two different wireless networks using Ubiquiti hardware
  • One wireless network running off a Netgear Nighthawk
  • Station G2, Heltec V3, and a T-Beam

Relevant log output

No response

@koliha koliha added the bug Something isn't working label Nov 26, 2024
@koliha
Copy link
Author

koliha commented Nov 26, 2024

This is what I see from the WiFi side -- this pattern repeats.
Screenshot 2024-11-26 at 4 48 49 PM

It's not related to signal, here is a closer AP with the same pattern/behavior:
Screenshot 2024-11-26 at 4 42 43 PM

If I check my DHCP logs for today I see a DHCP lease renewals up until the wifi drop at 2:57pm. The 3:16pm wifi drop is seen as well, but nothing for the 3:24/26/27 reconnects.
Screenshot 2024-11-26 at 4 50 49 PM

Not sure if I can set a static IP, but that's my next step in attempting to troubleshoot.

@garthvh
Copy link
Member

garthvh commented Nov 27, 2024

Are you connected to the public mqtt server? Do you have device logs?

@CTassisF
Copy link
Contributor

Have you tried the latest "Bleeding" firmware? A bug affecting Wi-Fi connectivity (reported in #5387) was recently fixed in PR #5439.

@koliha
Copy link
Author

koliha commented Nov 27, 2024

Are you connected to the public mqtt server? Do you have device logs?

No device logs. Using a public MQTT server (used by tens of others), but not "the" public MQTT server. MQTT server resides on my local network (connecting via private address).

Have you tried the latest "Bleeding" firmware? A bug affecting Wi-Fi connectivity (reported in #5387) was recently fixed in PR #5439.

This seems extremely relevant. Only using pre-built/published builds, so I'll probably wait for a release to test.

I have a static IP set as of 8pm on the problematic node and will report back the results. Based on #5387 I would assume that it's still going to occur with a static. Rather than try to set something up to collect logs, I'll probably wait out the next alpha to see if it solves.

@CTassisF
Copy link
Contributor

There were recent changes to MQTT servers in "private" (RFC1918) networks (discussed in #5203) that might be related to your issue. However, if I were to make an educated guess, most (if not all) of these Wi-Fi issues were likely caused by the memory leaks fixed in MQTT::onReceive.

@koliha
Copy link
Author

koliha commented Nov 27, 2024

There were recent changes to MQTT servers in "private" (RFC1918) networks (discussed in #5203) that might be related to your issue. However, if I were to make an educated guess, most (if not all) of these Wi-Fi issues were likely caused by the memory leaks fixed in MQTT::onReceive.

I am assuming the same re: MQTT::onReceive. I'm using public dns and IP and relying on a nat mirroring rule to route it back inside, so the RFC1918 issue shouldn't impact, but still great info to know.

@fifieldt
Copy link
Contributor

Potentially also relevant: we were power saving even when wifi was connected. That's fixed now: #5443

@koliha
Copy link
Author

koliha commented Nov 27, 2024

Potentially also relevant: we were power saving even when wifi was connected. That's fixed now: #5443

Saw that but it's not related.

Static IP has kept it from falling off completely overnight, but it's disconnecting+reconnecting to wifi over and over and seems a little wonky overall (pretty sure it's not beaconing device metrics to MQTT consistently). 2 minutes of no ping responses, responds for 5-7 seconds, then back to no response.

I'll wait for a release that incorporates #5387 before I attempt to troubleshoot further.

@leshniak
Copy link

I'm also experiencing the issue on 2.5.14.f2ee0df and latest beta. Waiting for the mentioned fixes.

@LowVoltagePirate
Copy link

I've build the files from the master repo and can confirm that the issue seems fixed, my t-beam is now online >48 without any issues

@leshniak
Copy link

I’ve also decided to make my own build yesterday. So far so good, it maintains a stable WiFi connection and heap usage stabilized at 90% after 24h.

image

@leshniak
Copy link

leshniak commented Nov 29, 2024

Hmm, the same issue sometimes occurs while connecting to the node via web interface and TCP. The device rebooted itself after few minutes. No logs for now unfortunately, due to remote location.

I have ~70 nodes in my NodeDB, if that matters.

Edit: looks like all is fine if I connect to the node shortly after the reboot.

@CTassisF
Copy link
Contributor

Meshtastic Firmware 2.5.15.79da236 Alpha was released with fixes for #5387.

@fifieldt
Copy link
Contributor

Fixed by #5387

@Matzebhv
Copy link

Matzebhv commented Dec 1, 2024

Had to reopen, Firmware 2.5.15.79da236 does not fix this issue. My 2 Supremes are disconnectig after a couple of hours.
Same behavior as described here #5458 (comment)

@commanderts
Copy link

Same problem here with an Heltec v3...

@koliha
Copy link
Author

koliha commented Dec 2, 2024

Also seeing the same behavior with my Station G2s on 2.5.15. I will try to pull logs later today.

@thebentern thebentern reopened this Dec 2, 2024
@Xaositek
Copy link

Xaositek commented Dec 2, 2024

I notice you're on a UniFi Network, here's my settings for you to compare against. I'm running UniFi Network 9.0.92, Firmware 6.7.9 and physical devices are two U6 Pros and a U6 In-Wall - I have not yet been able to reproduce any issues. I have a Heltec V3 and a T-Beam (Original) connected via WiFi, they run for weeks and zero issues; my Heltec V3 is on 2.5.15, flashed 3 days ago and no problems.

image

@koliha
Copy link
Author

koliha commented Dec 2, 2024

I notice you're on a UniFi Network, here's my settings for you to compare against. I'm running UniFi Network 9.0.92, Firmware 6.7.9 and physical devices are two U6 Pros and a U6 In-Wall - I have not yet been able to reproduce any issues. I have a Heltec V3 and a T-Beam (Original) connected via WiFi, they run for weeks and zero issues; my Heltec V3 is on 2.5.15, flashed 3 days ago and no problems.

Thanks for the reply. I'm seeing this on two fairly complex networks w/Ubiquiti as well as a very simple cable modem + netgear nighthawk consumer router setup. Reportedly the issue does not occur if MQTT is disabled - do you have your node connected via mqtt?

I'm currently timing the issue, re-checking the behavior post-patch, and capturing logs. After that I'll try turning MQTT off and see if it makes any difference. It would be nice to narrow it down. I highly doubt it's actually something to do with the code around networking.

@Xaositek
Copy link

Xaositek commented Dec 2, 2024

Yes, It is connected via MQTT to my local MQTT server, not public MQTT.

It is using a DNS name which will internally resolve to a local IP.

@Matzebhv
Copy link

Matzebhv commented Dec 2, 2024

It has nothing to do with unifi ore something. No special setup here, one node is connected to my fritzbox and the other is connected to my workplace with extreme enterprise ap. Long fast ist uplink only and 3 channels with moderate traffic are setup with up/downlink. If i disable mqtt the node will stay online.

@garthvh
Copy link
Member

garthvh commented Dec 2, 2024

It has nothing to do with unifi ore something. No special setup here, one node is connected to my fritzbox and the other is connected to my workplace with extreme enterprise ap. Long fast ist uplink only and 3 channels with moderate traffic are setup with up/downlink. If i disable mqtt the node will stay online.

If your mqtt is on the default topic that is likely too much traffic for the node to handle.

@leshniak
Copy link

leshniak commented Dec 3, 2024

If your mqtt is on the default topic that is likely too much traffic for the node to handle.

Mine is connecting to a local mosquitto instance (not bridged) and behaves the same way.

@koliha
Copy link
Author

koliha commented Dec 3, 2024

If your mqtt is on the default topic that is likely too much traffic for the node to handle.

Mine is connecting to a local mosquitto instance (not bridged) and behaves the same way.

Similar. One node is connecting to self-hosted mosquitto on LAN, the other is connecting to the same through the internet (NAT). Decently low traffic overall (way less than the official/dev server)

@leshniak
Copy link

leshniak commented Dec 3, 2024

The attached log from mosquitto shows how unstable the connection is. Then I've triggered a reboot via LoRa and it's all gone.

I think this looks a bit weird:

1733217530: New connection from 192.168.5.85:55328 on port 1883.
1733217530: New client connected from 192.168.5.85:55328 as !2f93dc9c (p2, c1, k15).
1733217554: Client !2f93dc9c has exceeded timeout, disconnecting.
1733217561: New connection from 192.168.5.85:55329 on port 1883.
1733217561: New client connected from 192.168.5.85:55329 as !2f93dc9c (p2, c1, k15).
1733217584: Client !2f93dc9c has exceeded timeout, disconnecting.
1733217589: New connection from 192.168.5.85:50628 on port 1883.
1733217589: New client connected from 192.168.5.85:50628 as !2f93dc9c (p2, c1, k15).
1733217678: New connection from 192.168.5.85:51534 on port 1883.
1733217678: Client !2f93dc9c already connected, closing old connection.
1733217678: New client connected from 192.168.5.85:51534 as !2f93dc9c (p2, c1, k15).

I have a Wemos D1 Mini placed in the exact same location, with ESPHome firmware and it's super stable despite having a ~10 dBm weaker WiFi signal.

log.txt

@garthvh
Copy link
Member

garthvh commented Dec 3, 2024

Is the mqtt JSON functionality being used? Who is hosting the private brokers?

@koliha
Copy link
Author

koliha commented Dec 3, 2024

Is the mqtt JSON functionality being used? Who is hosting the private brokers?

Not in any of my use cases. I'm hosting the mqtt server locally (ubuntu vm, mosquitto). Seeing it w/node on LAN as well as one that is connecting through NAT/internet.

@leshniak
Copy link

leshniak commented Dec 4, 2024

Currently I'm testing the connection through WiFi repeater and with different network settings on both ends - maybe it doesn't like my access point and/or vice versa. I'll let you know about the results in next couple of days.

(testing also #5490 case)

@koliha
Copy link
Author

koliha commented Dec 5, 2024

I just wanted to provide an update as it's been a couple of days. During my initial tests, both nodes died at 36 hours. Currently, both have been up for 65 hours. The only thing that is different is that I'm using PRTG to ping and scrape json data (http) every 60 seconds.

I'm going to turn off the ping/json traffic, reboot both nodes, and see if I can reproduce the original issue. I'll circle back here in ~48hr.

@EnGamma
Copy link

EnGamma commented Dec 7, 2024

I'm seeing this since updating fw from 2.3.10 to 2.5.11 on my 4 TLORA_V2_1_1P6 devices. Can't reliably connect by from phone via WiFi to fixed IP of nodes. Sometimes switching between nodes and back connects, but often not. Sometimes switching between WiFi SSIDs (all same network) fixes, but often not.

@koliha
Copy link
Author

koliha commented Dec 9, 2024

Thank you all for your patience while I attempted to reproduce this. The original test nodes have been up for 4 days solid and the remote node (on a completely different "consumer" network setup) has been up and going for a little over 2 days now.

Subjectively, I feel that this issue can be closed out as resolved. I was able to reproduce this easily on 2.4.13/2.4.14. With 2.4.15 I had an MQTT/network drop-off initially (with two nodes simultaneously), but I haven't been able to reproduce the issue since. T1000-E stopped randomly rebooting itself with the fixes in 2.5.15 also.

It would be good to note on the github releases page that this mqtt/network issue exists in 2.3.10 - 2.3.14 (apparently).

@Matzebhv
Copy link

Matzebhv commented Dec 9, 2024

Subjectively, I feel that this issue can be closed out as resolved.
OK? What is the solution here?

@koliha
Copy link
Author

koliha commented Dec 9, 2024

Subjectively, I feel that this issue can be closed out as resolved.
OK? What is the solution here?

Upgrade to 2.5.15.

@Matzebhv
Copy link

Matzebhv commented Dec 9, 2024

Subjectively, I feel that this issue can be closed out as resolved.
OK? What is the solution here?

Upgrade to 2.5.15.

Really?? Whe are on .15. This problem is not solved. Please reopen, this really sucks.

@leshniak
Copy link

The issue still exists. I've managed to make it more stable by adding a esp_wifi_repeater but it's a workaround. Now my theory is that something is clogging (buffer? memory leak?) when TCP segments have to be retransmitted, because the same thing occurs while using the Web client.

@Matzebhv
Copy link

Same as here -> #5549

@Matzebhv
Copy link

What is really bad, you have a couple of people in this issue with this problem. On person say "it is fixed for me" and zapp -> closed.

@thebentern
Copy link
Contributor

What is really bad, you have a couple of people in this issue with this problem. On person say "it is fixed for me" and zapp -> closed.

The original reporter of this issue said that their issue was fixed for them by upgrading. You have inserted yourself into this issue without any background context to whether or not you are even experiencing the same scenario, on the same hardware, or any reproduction steps. I think you should rework your approach here to be less combative and demanding.

@koliha
Copy link
Author

koliha commented Dec 13, 2024

Just as a final update, both of the original nodes that had issues have been up for almost 8 days solid now. I've started upgrading remote nodes (~100mi away) with 2.5.15/.16 and have not had any issues with them either. Mixture of ESP32 and NRF hardware. I'm also tracking heap free on one of the two original problematic nodes and haven't seen anything that screams memory leak either.

Screenshot 2024-12-13 at 1 34 56 PM
Screenshot 2024-12-13 at 1 34 09 PM

Thanks for the assistance/patience on this one.

@koliha
Copy link
Author

koliha commented Dec 16, 2024

Spoke too soon, not solved. Looks like a memory leak ( see #5549 ). #5559 was opened for the same reason it seems. Posted additional details/comments there since it's open and this one has been closed.

Screenshot 2024-12-16 at 11 01 01 AM
Screenshot 2024-12-16 at 11 01 13 AM

@xhci
Copy link

xhci commented Dec 18, 2024

I also see this on a Station G2. I have the power supply set to cycle every x hours as a workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.