Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Messages often come through garbled as "�떗" since recent update #3452

Closed
Obinice opened this issue Mar 15, 2024 · 61 comments · Fixed by #3474
Closed

Messages often come through garbled as "�떗" since recent update #3452

Obinice opened this issue Mar 15, 2024 · 61 comments · Fixed by #3474

Comments

@Obinice
Copy link

Obinice commented Mar 15, 2024

Since updating to Android client 2.3.0, messages on LongFast sometimes come through garbled. In one particularly bad case, 8 out of 9 messages received over the course of an hour were garbled. The displayed text is always the same, regardless of the original message content, or the sender.

��떗

This was also a few days after applying firmware update 2.2.24 to my Heltec V3, so it is possible the issue resides there, however no iOS users I've spoken with thus far have noted any issues since the firmware update, and in the days between updating the firmware and the Android client this issue did not crop up. So, my first thought is that it could be a client issue (but this needs more investigation).

Attached are three screenshots from the Android app, one showing LongFast with two garbled messages, and the other two screenshots are from the debug screen showing the details of those two messages.

IMG_20240314_230748
IMG_20240314_230619
IMG_20240314_230642

A few other users have discussed the issue in the Meshtastic Discord, they might be able to provide more information, and if there's any more info I can provide just let me know. Thank you for all your hard work, it's greatly appreciated <3

@andrekir
Copy link
Member

is LongFast connected to MQTT?

the debug log shows the payload in bytes, so this is how it came in from the radio. can you get the serial logs from the Heltec as this comes through?

@Obinice
Copy link
Author

Obinice commented Mar 16, 2024

I don't have any public MQTT enabled.

I'm afraid my node is sealed up on a mast above the building currently, I won't have physical access to it again for another week or so, but I'll link other users who have been experiencing this to this issue, hopefully one of them can collect the serial data when one of these messages comes through. Thanks!

@Vanguard4893
Copy link

Can confirm this issue is affecting myself also - serial debug log from the radio below. The message field appears as ????

This issue seems to affect messages at random, and 5 messages from the same endpoint will show 1-2 corrupted.

DEBUG | 15:45:30 2679 [RadioIf] (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=26, time 428 ms DEBUG | 15:45:30 2679 [RadioIf] Lora RX (id=0xb299fbc0 fr=0xf0 to=0xff, WantAck=0, HopLim=1 Ch=0x8 encrypted rxSNR=6.25 rxRSSI=-77) DEBUG | 15:45:30 2679 [RadioIf] AirTime - Packet received : 428ms DEBUG | 15:45:30 2679 [Router] recentPackets size=96 DEBUG | 15:45:30 2679 [Router] recentPackets size=94 (after clearing expired packets) DEBUG | 15:45:30 2679 [Router] Add packet record (id=0xb299fbc0 fr=0xf0 to=0xff, WantAck=0, HopLim=1 Ch=0x8 encrypted rxSNR=6.25 rxRSSI=-77) DEBUG | 15:45:30 2679 [Router] recentPackets size=95 DEBUG | 15:45:30 2679 [Router] recentPackets size=95 (after clearing expired packets) DEBUG | 15:45:30 2679 [Router] Using channel 0 (hash 0x8) DEBUG | 15:45:30 2679 [Router] Expanding short PSK meshtastic/Meshtastic-Android#1 DEBUG | 15:45:30 2679 [Router] Using AES128 key! DEBUG | 15:45:30 2679 [Router] ESP32 crypt fr=da5678f0, num=b299fbc0, numBytes=10! DEBUG | 15:45:30 2679 [Router] decoded message (id=0xb299fbc0 fr=0xf0 to=0xff, WantAck=0, HopLim=1 Ch=0x0 Portnum=1 rxtime=1710603930 rxSNR=6.25 rxRSSI=-77) DEBUG | 15:45:30 2679 [Router] handleReceived(REMOTE) (id=0xb299fbc0 fr=0xf0 to=0xff, WantAck=0, HopLim=1 Ch=0x0 Portnum=1 rxtime=1710603930 rxSNR=6.25 rxRSSI=-77) DEBUG | 15:45:30 2679 [Router] Module 'text' wantsPacket=1 INFO | 15:45:30 2679 [Router] Received text msg from=0xda5678f0, id=0xb299fbc0, msg=????

@clwgh
Copy link

clwgh commented Mar 16, 2024

I noticed after flashing 2.3.0, with the Heltec V3 still connected, the terminal halted with some words corrupted with what I think were those or similar characters. It did this consistently and in the same way and at the same point when flashing 2.3.0 each time. Flashing 2.2.24 did not do this and the terminal would continue to display updating text after rebooting. Perhaps this is related. I cannot test for the exact symbols seen right now unfortunately.

Edit: the exact text is below. Possibly not related but mentioned in case useful and triggers something.

...
Writing at 0x3a6625... (93%)
Writing at 0x3aa620... (96%)
Writing at 0x3b1a4c... (100%)
Wrote 1048576 bytes (473054 compressed) at 0x300000 in 43.734 seconds.
Leaving...
ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0x1 (POWERON),boot:0x29 (SPI_FAST_FLASH_BOOT)
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce3808,len:0x44c
load:0x403c9700,len:0xbe4
load:0x403cc700,len:0x2a38
entry 0x403c98d4
E (350) esp_core_dump_flash: No core dump partition found!
E (350) esp_core_dum����@INFO  | ??:??:?? 0 

//\ E S H T /\ S T / C

INFO  | ??:??:?? 0 Booted, wake cause 0 (boot count 1), reset_reason=reset
D

@Obinice
Copy link
Author

Obinice commented Mar 16, 2024

I realised I could pull my mobile node out to capture this for you @andrekir , I hope this snapshot is enough, you can see the garbled message received and also retransmitted:

DEBUG | 16:29:51 142 [Power] Battery: usbPower=0, isCharging=0, batMv=4047, batPct=89
DEBUG | 16:29:53 144 [RadioIf] (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=26, time 428 ms
DEBUG | 16:29:53 144 [RadioIf] Lora RX (id=0x0fdf4d7e fr=0x40 to=0xff, WantAck=0, HopLim=1 Ch=0x8 encrypted rxSNR=5.25 rxRSSI=-76)
DEBUG | 16:29:53 144 [RadioIf] AirTime - Packet received : 428ms
DEBUG | 16:29:53 144 [Router] Add packet record (id=0x0fdf4d7e fr=0x40 to=0xff, WantAck=0, HopLim=1 Ch=0x8 encrypted rxSNR=5.25 rxRSSI=-76)
DEBUG | 16:29:53 144 [Router] Using channel 0 (hash 0x8)
DEBUG | 16:29:53 144 [Router] Expanding short PSK meshtastic/Meshtastic-Android#1
DEBUG | 16:29:53 144 [Router] Using AES128 key!
DEBUG | 16:29:53 144 [Router] ESP32 crypt fr=e66cf840, num=fdf4d7e, numBytes=10!
DEBUG | 16:29:53 144 [Router] decoded message (id=0x0fdf4d7e fr=0x40 to=0xff, WantAck=0, HopLim=1 Ch=0x0 Portnum=1 rxtime=1710606593 rxSNR=5.25 rxRSSI=-76)
DEBUG | 16:29:53 144 [Router] handleReceived(REMOTE) (id=0x0fdf4d7e fr=0x40 to=0xff, WantAck=0, HopLim=1 Ch=0x0 Portnum=1 rxtime=1710606593 rxSNR=5.25 rxRSSI=-76)
DEBUG | 16:29:54 144 [Router] Module 'text' wantsPacket=1
INFO  | 16:29:54 144 [Router] Received text msg from=0xe66cf840, id=0xfdf4d7e, msg=▒떗
DEBUG | 16:29:54 144 [Router] Enter state: ON
DEBUG | 16:29:54 144 [Router] showing standard frames
DEBUG | 16:29:54 144 [Router] Module wants a UI Frame
DEBUG | 16:29:54 144 [Router] Showing 1 module frames
DEBUG | 16:29:54 144 [Router] Total frame count: 104
DEBUG | 16:29:54 144 [Router] Added modules.  numframes: 1
DEBUG | 16:29:54 144 [Router] Finished building frames. numframes: 9
DEBUG | 16:29:54 144 [Router] Module 'text' considered
DEBUG | 16:29:54 144 [Router] Module 'canned' wantsPacket=1
DEBUG | 16:29:54 144 [Router] Module 'canned' considered
DEBUG | 16:29:54 144 [Router] Module 'StoreForward' wantsPacket=1
DEBUG | 16:29:54 144 [Router] Module 'StoreForward' considered
DEBUG | 16:29:54 144 [Router] Module 'ExternalNotificationModule' wantsPacket=1
INFO  | 16:29:54 145 [Router] External Notification Module Disabled
DEBUG | 16:29:54 145 [Router] Module 'ExternalNotificationModule' considered
DEBUG | 16:29:54 145 [Router] Module 'routing' wantsPacket=1
INFO  | 16:29:54 145 [Router] Received routing from=0xe66cf840, id=0xfdf4d7e, portnum=1, payloadlen=6
DEBUG | 16:29:54 145 [Router] Routing sniffing (id=0x0fdf4d7e fr=0x40 to=0xff, WantAck=0, HopLim=1 Ch=0x0 Portnum=1 rxtime=1710606593 rxSNR=5.25 rxRSSI=-76)
INFO  | 16:29:54 145 [Router] Rebroadcasting received floodmsg to neighbors
DEBUG | 16:29:54 145 [Router] Original length - 6
DEBUG | 16:29:54 145 [Router] Compressed length - 10
DEBUG | 16:29:54 145 [Router] Original message - ▒떗
DEBUG | 16:29:54 145 [Router] Not using compressing message.
DEBUG | 16:29:54 145 [Router] Expanding short PSK meshtastic/Meshtastic-Android#1
DEBUG | 16:29:54 145 [Router] Using AES128 key!
DEBUG | 16:29:54 145 [Router] ESP32 crypt fr=e66cf840, num=fdf4d7e, numBytes=10!
DEBUG | 16:29:54 145 [Router] enqueuing for send (id=0x0fdf4d7e fr=0x40 to=0xff, WantAck=0, HopLim=0 Ch=0x8 encrypted rxtime=1710606593 rxSNR=5.25 rxRSSI=-76)
DEBUG | 16:29:54 145 [Router] txGood=9,rxGood=15,rxBad=0
DEBUG | 16:29:54 145 [Router] rx_snr found. hop_limit:0 rx_snr:5.250000
DEBUG | 16:29:54 145 [Router] rx_snr found in packet. Setting tx delay:3024
DEBUG | 16:29:54 145 [Router] Delivering rx packet (id=0x0fdf4d7e fr=0x40 to=0xff, WantAck=0, HopLim=1 Ch=0x0 Portnum=1 rxtime=1710606593 rxSNR=5.25 rxRSSI=-76)
DEBUG | 16:29:54 145 [Router] Update DB node 0xe66cf840, rx_time=1710606593
DEBUG | 16:29:54 145 [Router] Forwarding to phone (id=0x0fdf4d7e fr=0x40 to=0xff, WantAck=0, HopLim=1 Ch=0x0 Portnum=1 rxtime=1710606593 rxSNR=5.25 rxRSSI=-76)
DEBUG | 16:29:54 145 [Router] Module 'routing' considered
INFO  | 16:29:54 145 Telling client we have new packets 19
INFO  | 16:29:54 145 [ServerAPI] getFromRadio=STATE_SEND_PACKETS
DEBUG | 16:29:54 145 [ServerAPI] phone downloaded packet (id=0x0fdf4d7e fr=0x40 to=0xff, WantAck=0, HopLim=1 Ch=0x0 Portnum=1 rxtime=1710606593 rxSNR=5.25 rxRSSI=-76)
DEBUG | 16:29:54 145 [ServerAPI] encoding toPhone packet to phone variant=2, 52 bytes
DEBUG | 16:29:55 145 [RadioIf] (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=71, time 763 ms

@asjmcguire
Copy link

asjmcguire commented Mar 16, 2024

I'll just add my 2p here, I don't think it's Android that is the problem.... it's a firmware issue - and we brought it up in the firmware discord yesterday. I'd also add that perhaps the hopStart data should be added to the debugging output....

@b8b8
Copy link

b8b8 commented Mar 16, 2024

ive had the same results as above. Down grading my firmware to 2.2.24 but keeping the app at 2.3 prevents me from seeing any garbled messages as well as receiving more periodic telemetry than before. Before i would go hours or days between updates of telemetry from remote nodes while on 2.3 firmware.

@GUVWAF
Copy link
Member

GUVWAF commented Mar 16, 2024

@Obinice Do you know if the message originated from the Android app as well?

So far I couldn't reproduce the issue sending between 2.2.24 and 2.3 devices, as well as from 2.2.24 --> 2.3.0 --> 2.2.24 or 2.2.24 --> 2.3.0 --> 2.3.0. This is when sending from either iOS or CLI (currently don't have an Android device with me). I've tried different kind of messages like DMs, traceroute and position request.

I'd also add that perhaps the hopStart data should be added to the debugging output....

It will be shown, but only if all nodes in the path are on 2.3.0.

@prokrypt
Copy link
Contributor

prokrypt commented Mar 16, 2024

@GUVWAF

It will be shown, but only if all nodes in the path are on 2.3.0.

But what if a node in the path has a different firmware version? What happens in these cases? Could it be that a repeating node is introducing some garbly garbage, chopping off some bits, or some other quirk?
2.2.24 (source) -> 2.3.0 (repeater) -> 2.2.24 (destination)
and
2.3.0 (source) -> 2.2.24 (repeater) -> 2.3.0 (destination)

@asjmcguire
Copy link

asjmcguire commented Mar 16, 2024

It will be shown, but only if all nodes in the path are on 2.3.0.

Not on line 271 though, which is clearly where the vast majority of debugging output seen above is originating.
Why do all nodes in the path have to be on 2.3.0 - I thought the point of HopStart in the header was that as long as the sender and receiver was on 2.3.0, the nodes in between didn't matter too much what firmware they were running.

@GUVWAF
Copy link
Member

GUVWAF commented Mar 16, 2024

Over LoRa, three previously unused bits in the header are now used for hopStart. Internally when receiving the firmware will parse these bits and put them in a protobuf field hopStart, which is new, but added at the end of the MeshPacket to ensure backwards compatibility. This is also how the packet is sent to a client.
Then when it rebroadcasts it, it does the reverse operation: it puts the value of the hopStart field in the header.

Nodes on older firmware won't use those bits and do not use/know the hopStart protobuf field.

@Obinice
Copy link
Author

Obinice commented Mar 17, 2024

I'll just add my 2p here, I don't think it's Android that is the problem.... it's a firmware issue - and we brought it up in the firmware discord yesterday. I'd also add that perhaps the hopStart data should be added to the debugging output....

@asjmcguire Agreed, at the time I suspected the app, but now I'm not so sure. If I knew how to move this to the appropriate location I would 😅

Do you know if the message originated from the Android app as well?

@GUVWAF The messages I received via the test I debug logged were sent via Android client, the information I was provided by the sender was "V2.3.0 app, v2.3.1 alpha node, and v2.2.24 node" , however the others that I've seen in the wild I'm afraid I have no clue, sorry.

I'm running firmware 2.2.24 on my base node (which will have been the one relaying the message to my mobile node, which I used for the debug log). My mobile node, which I only used to receive those test messages for debug yesterday, is on 2.2.23.

In terms of being the recipient, I've since tried a few different versions of the Android client, the Play Store beta as well as the Play Store and F-Droid store stable releases, but the problem is present regardless. I've also been doing some private MQTT logging from the base node directly to mosquitto on a local server (only in the past day - can't have been affecting this issue), and the messages are logged there as follows:

{"channel":0,"from":3865901120,"id":266294490,"payload":{"text":"\u0000\u0000\u0000\u0000\u0000\u0000"},"rssi":-99,"sender":"!e2e37908","snr":3.25,"timestamp":1710664304,"to":4294967295,"type":"text"}

(The above is just a random example of one of these garbled messages, received this morning.

On occasion, the garbled message received seems to be very slightly different, it contains a quotation mark at the end. I know MQTT JSON isn't the best format to show this, but here's an example of one I received yesterday, where the actual visible output in the client looked something like ��떗"

{"channel":0,"from":3664101024,"id":2347968366,"payload":{"text":"\u0000\u0000\u0000\u0000\u0000\u0000\""},"rssi":-97,"sender":"!e2e37908","snr":-0.5,"timestamp":1710599894,"to":4294967295,"type":"text"}

This is the only difference in the pattern that I've seen.

ive had the same results as above. Down grading my firmware to 2.2.24 but keeping the app at 2.3 prevents me from seeing any garbled messages

@b8b8 Unfortunately I'm on 2.2.24 myself, and am seeing the issue on app 2.3 as well as the previous stable version. I'm also seeing it on my 2.2.23 mobile node - it seems likely that the corruption is being generated somewhere on the mesh, perhaps at the origin, and is simply being passed along.

@Obinice
Copy link
Author

Obinice commented Mar 17, 2024

One small update, I just happened to have the web UI open when I received a garbled message, and the text displayed is very slightly different. I don't know if that will help troubleshooting, but for completeness here it is:

��떗�

image

@asjmcguire
Copy link

There are only 2 real changes in the 2.3.0 firmware that have changed anything to do with message sending right @GUVWAF ? One of the introduction of the hop_start header and the other is dynamic changing of the hoplimit based on the calculation of how many hops away the response needs to be sent. All the other fixes in this firmware have been about precision for GPS, general GPS fixes and power saving fixes. So in my opinion it has to be something introduced with the 2 changes that deal with forming the actual mesh packet.

@GUVWAF
Copy link
Member

GUVWAF commented Mar 17, 2024

Yes, it almost certainly has to do with the addition of hopStart to the MeshPacket. Changing the hop limit on a response doesn't impact a normal text message.

It's weird that only the payload seems affected, from/to, hopLimit, SNR, etc. all look OK.

I don't have time currently to test more thoroughly, but can look into it this week, then also with Android app. Would be good to know if it is already garbled in the logs when the packet arrives from the phone at the original sender.

@garthvh
Copy link
Member

garthvh commented Mar 17, 2024

Somebody would have reported it for iOS if it is in the transport.

@asjmcguire
Copy link

The problem is (in the UK at least) many people haven't updated to 2.3.0 because it's labelled alpha and thus not stable, so there aren't enough 2.3.0 nodes "in the wild" to get better debugging. When the guys who were trying to track this down yesterday are back around - we will get them to log sending as well receiving and see if we can find a common factor.

@GUVWAF
Copy link
Member

GUVWAF commented Mar 18, 2024

I'm still unable to reproduce.
I tried with the Android app version 2.2.23 and also version 2.3.0, both at the transmitter and receiver side, in combination with a mix of firmware 2.2.24 and 2.3.0, also with a hop in between. Tried all kind of different text with various characters/symbols and length, and also traceroute. I've never seen garbled text; not in the Android app, iOS app or serial logs - and also traceroute always works.
This was using a RAK4631, T-Beam and Heltec Wireless Paper. Maybe it's hardware specific?

@eldoctorcool
Copy link

On the Traceroute matter, I am using Lora32 1.6 boards, and TBeams 1.1.

@GiuStel
Copy link

GiuStel commented Mar 18, 2024

I noticed after flashing 2.3.0, with the Heltec V3 still connected, the terminal halted with some words corrupted with what I think were those or similar characters. It did this consistently and in the same way and at the same point when flashing 2.3.0 each time. Flashing 2.2.24 did not do this and the terminal would continue to display updating text after rebooting. Perhaps this is related. I cannot test for the exact symbols seen right now unfortunately.

Edit: the exact text is below. Possibly not related but mentioned in case useful and triggers something.

...
Writing at 0x3a6625... (93%)
Writing at 0x3aa620... (96%)
Writing at 0x3b1a4c... (100%)
Wrote 1048576 bytes (473054 compressed) at 0x300000 in 43.734 seconds.
Leaving...
ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0x1 (POWERON),boot:0x29 (SPI_FAST_FLASH_BOOT)
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce3808,len:0x44c
load:0x403c9700,len:0xbe4
load:0x403cc700,len:0x2a38
entry 0x403c98d4
E (350) esp_core_dump_flash: No core dump partition found!
E (350) esp_core_dum����@INFO  | ??:??:?? 0 

//\ E S H T /\ S T / C

INFO  | ??:??:?? 0 Booted, wake cause 0 (boot count 1), reset_reason=reset
D

Same happened here at a friend's node:

ce1d29a9-de70-49a6-8ad1-95038fa5b473

@GiuStel
Copy link

GiuStel commented Mar 18, 2024

Here, on a near to 1 hundred nodes mesh, the switch to 2.3.0 is being a hell.
We've not been able yet to realize what breaks what but most of the messages are unreadable, traceroute works but reports one hop only (most of the times), nodes coming from an MQTT branch never show up in nodes list unless you already had them into it. Receiving telemetries seems to be very hard too.
I'm available for any test that might be useful.

Thanks

a791ddc0-5479-4521-9de9-4ffeb63c4cde

@GUVWAF
Copy link
Member

GUVWAF commented Mar 18, 2024

MQTT nodes using 2.3.0 will not be shown for nodes with older firmware and vice versa; this is intended and listed in the release notes.

It would be good to collect data about nodes that send out garbled text, especially when they already show this in the serial logs after receiving it from the phone. Would be good to know what the content of the actual message was and which hardware and client (app version) they are using.

@eldoctorcool
Copy link

Wow, 100 nodes... and I have about 10 nodes max, ... at first when I saw this I thought radio was out of range... the failed traceroutes, but RSSI was well within good range. What I've rolled back to 2.2.23, as 2.2.24 also seems to have GPS issues as well... what used to be 75 meters in SmartPosition, now I had to make it <15 meters to match what used to be 75 meters, otherwise it will take about 1 km to report moving things on another client...

@dantuf
Copy link

dantuf commented Mar 19, 2024

Maybe this is a clue. I get a lot of these messages in the log on my inside node when traffic is forwarded by my outside node on the roof running 2.3.1 firmware (heltec v3).

ERROR | 18:36:36 262 [Router] Can't decode protobuf reason='wrong wire type', pb_msgdesc 0x3c15aca8
ERROR | 18:36:36 262 [Router] Error decoding protobuf module!
ERROR | 18:36:36 262 [Router] Can't decode protobuf reason='wrong wire type', pb_msgdesc 0x3c15aca8
ERROR | 18:36:36 262 [Router] Error decoding protobuf module!
DEBUG | 18:36:36 262 [Router] Module 'nodeinfo' considered
DEBUG | 18:36:36 262 [Router] Module 'nodeinfo' handled and skipped other processing
ERROR | 18:36:47 272 [Router] Can't decode protobuf reason='wrong wire type', pb_msgdesc 0x3c15b3d0
ERROR | 18:36:47 272 [Router] Error decoding protobuf module!
ERROR | 18:36:47 272 [Router] Can't decode protobuf reason='wrong wire type', pb_msgdesc 0x3c15b3d0
ERROR | 18:36:47 272 [Router] Error decoding protobuf module!
DEBUG | 18:36:47 272 [Router] Module 'DeviceTelemetry' considered
DEBUG | 18:36:47 272 [Router] Module 'DeviceTelemetry' handled and skipped other processing

It seems like the packets arrive but most often than not the payload is bad. I guess in the case of text messages the payload is passed on directly showing the garbled contents.

In my case the inside node don't hear distant nodes so more or less all traffic passes by the roof node which makes the problem very apparent. Enabling MQTT on the roof node to a local broker shows the packets arriving just fine. I believe it's just the packet forwarding that somehow corrupt the payload.

@GUVWAF
Copy link
Member

GUVWAF commented Mar 19, 2024

@dantuf Do you have the logs just before that? I want to know if the payload length makes sense.

Also, which firmware version is your inside node using and what hardware is it? And do you know the firmware and hardware type of the device that originally sent this packet?

@dantuf
Copy link

dantuf commented Mar 19, 2024

Okay, here we go. The inside node showing this log is a heltec v3 running 2.3.1

DEBUG | 19:18:48 2793 [RadioIf] (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=26, time 428 ms
DEBUG | 19:18:48 2793 [RadioIf] Lora RX (id=0x1b72d8e9 fr=0x60 to=0xff, WantAck=0, HopLim=2 Ch=0x8 encrypted rxSNR=7 rxRSSI=-68 hopStart=3)
DEBUG | 19:18:48 2793 [RadioIf] AirTime - Packet received : 428ms
DEBUG | 19:18:48 2794 [Router] Add packet record (id=0x1b72d8e9 fr=0x60 to=0xff, WantAck=0, HopLim=2 Ch=0x8 encrypted rxSNR=7 rxRSSI=-68 hopStart=3)
DEBUG | 19:18:48 2794 [Router] Using channel 0 (hash 0x8)
DEBUG | 19:18:48 2794 [Router] Expanding short PSK meshtastic/Meshtastic-Android#1
DEBUG | 19:18:48 2794 [Router] Using AES128 key!
DEBUG | 19:18:48 2794 [Router] ESP32 crypt fr=75e19a60, num=1b72d8e9, numBytes=10!
DEBUG | 19:18:48 2794 [Router] decoded message (id=0x1b72d8e9 fr=0x60 to=0xff, WantAck=0, HopLim=2 Ch=0x0 Portnum=4 rxtime=1710875928 rxSNR=7 rxRSSI=-68 hopStart=3)
DEBUG | 19:18:48 2794 [Router] handleReceived(REMOTE) (id=0x1b72d8e9 fr=0x60 to=0xff, WantAck=0, HopLim=2 Ch=0x0 Portnum=4 rxtime=1710875928 rxSNR=7 rxRSSI=-68 hopStart=3)
DEBUG | 19:18:48 2794 [Router] Module 'nodeinfo' wantsPacket=1
INFO  | 19:18:49 2794 [Router] Received nodeinfo from=0x75e19a60, id=0x1b72d8e9, portnum=4, payloadlen=6
ERROR | 19:18:49 2794 [Router] Can't decode protobuf reason='wrong wire type', pb_msgdesc 0x3c15aca8
ERROR | 19:18:49 2794 [Router] Error decoding protobuf module!
ERROR | 19:18:49 2794 [Router] Can't decode protobuf reason='wrong wire type', pb_msgdesc 0x3c15aca8
ERROR | 19:18:49 2794 [Router] Error decoding protobuf module!
DEBUG | 19:18:49 2794 [Router] Module 'nodeinfo' considered
DEBUG | 19:18:49 2794 [Router] Module 'nodeinfo' handled and skipped other processing
DEBUG | 19:18:50 2795 [RadioIf] (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=26, time 428 ms
DEBUG | 19:18:50 2795 [RadioIf] Lora RX (id=0x5a4c466f fr=0xe8 to=0xff, WantAck=0, HopLim=0 Ch=0x8 encrypted rxSNR=6.75 rxRSSI=-69)
DEBUG | 19:18:50 2795 [RadioIf] AirTime - Packet received : 428ms
DEBUG | 19:18:50 2795 [Router] Add packet record (id=0x5a4c466f fr=0xe8 to=0xff, WantAck=0, HopLim=0 Ch=0x8 encrypted rxSNR=6.75 rxRSSI=-69)
DEBUG | 19:18:50 2795 [Router] Using channel 0 (hash 0x8)
DEBUG | 19:18:50 2795 [Router] Expanding short PSK meshtastic/Meshtastic-Android#1
DEBUG | 19:18:50 2795 [Router] Using AES128 key!
DEBUG | 19:18:50 2795 [Router] ESP32 crypt fr=e2e2e1e8, num=5a4c466f, numBytes=10!
DEBUG | 19:18:50 2795 [Router] decoded message (id=0x5a4c466f fr=0xe8 to=0xff, WantAck=0, HopLim=0 Ch=0x0 Portnum=66 rxtime=1710875930 rxSNR=6.75 rxRSSI=-69)
DEBUG | 19:18:50 2795 [Router] handleReceived(REMOTE) (id=0x5a4c466f fr=0xe8 to=0xff, WantAck=0, HopLim=0 Ch=0x0 Portnum=66 rxtime=1710875930 rxSNR=6.75 rxRSSI=-69)
DEBUG | 19:18:50 2795 [Router] Module 'routing' wantsPacket=1
INFO  | 19:18:50 2795 [Router] Received routing from=0xe2e2e1e8, id=0x5a4c466f, portnum=66, payloadlen=6
DEBUG | 19:18:50 2795 [Router] Routing sniffing (id=0x5a4c466f fr=0xe8 to=0xff, WantAck=0, HopLim=0 Ch=0x0 Portnum=66 rxtime=1710875930 rxSNR=6.75 rxRSSI=-69)
DEBUG | 19:18:50 2795 [Router] Delivering rx packet (id=0x5a4c466f fr=0xe8 to=0xff, WantAck=0, HopLim=0 Ch=0x0 Portnum=66 rxtime=1710875930 rxSNR=6.75 rxRSSI=-69)
DEBUG | 19:18:50 2795 [Router] Update DB node 0xe2e2e1e8, rx_time=1710875930
INFO  | 19:18:50 2795 [Router] Heard a node on channel 0 we don't know, sending NodeInfo and asking for a response.
DEBUG | 19:18:50 2795 [Router] cancelSending id=0x15657d5b, removed=0
DEBUG | 19:18:50 2795 [Router] Sending NodeInfo will be ignored since we just sent it.
DEBUG | 19:18:50 2795 [Router] Forwarding to phone (id=0x5a4c466f fr=0xe8 to=0xff, WantAck=0, HopLim=0 Ch=0x0 Portnum=66 rxtime=1710875930 rxSNR=6.75 rxRSSI=-69)
DEBUG | 19:18:50 2795 [Router] Module 'routing' considered
DEBUG | 19:18:57 2802 [RadioIf] (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=31, time 468 ms
DEBUG | 19:18:57 2802 [RadioIf] Lora RX (id=0x4f35ffdd fr=0x36 to=0xdc, WantAck=0, HopLim=2 Ch=0x8 encrypted rxSNR=6.25 rxRSSI=-68)
DEBUG | 19:18:57 2802 [RadioIf] AirTime - Packet received : 468ms
DEBUG | 19:18:57 2803 [Router] Add packet record (id=0x4f35ffdd fr=0x36 to=0xdc, WantAck=0, HopLim=2 Ch=0x8 encrypted rxSNR=6.25 rxRSSI=-68)
DEBUG | 19:18:57 2803 [Router] Using channel 0 (hash 0x8)
DEBUG | 19:18:57 2803 [Router] Expanding short PSK meshtastic/Meshtastic-Android#1
DEBUG | 19:18:57 2803 [Router] Using AES128 key!
DEBUG | 19:18:57 2803 [Router] ESP32 crypt fr=2d786336, num=4f35ffdd, numBytes=15!
DEBUG | 19:18:57 2803 [Router] decoded message (id=0x4f35ffdd fr=0x36 to=0xdc, WantAck=0, HopLim=2 Ch=0x0 Portnum=4 requestId=5bff2716 rxtime=1710875937 rxSNR=6.25 rxRSSI=-68)
DEBUG | 19:18:57 2803 [Router] handleReceived(REMOTE) (id=0x4f35ffdd fr=0x36 to=0xdc, WantAck=0, HopLim=2 Ch=0x0 Portnum=4 requestId=5bff2716 rxtime=1710875937 rxSNR=6.25 rxRSSI=-68)
DEBUG | 19:18:57 2803 [Router] Module 'nodeinfo' wantsPacket=1
INFO  | 19:18:57 2803 [Router] Received nodeinfo from=0x2d786336, id=0x4f35ffdd, portnum=4, payloadlen=6
ERROR | 19:18:57 2803 [Router] Can't decode protobuf reason='wrong wire type', pb_msgdesc 0x3c15aca8
ERROR | 19:18:58 2803 [Router] Error decoding protobuf module!
ERROR | 19:18:58 2803 [Router] Can't decode protobuf reason='wrong wire type', pb_msgdesc 0x3c15aca8
ERROR | 19:18:58 2803 [Router] Error decoding protobuf module!
DEBUG | 19:18:58 2803 [Router] Module 'nodeinfo' considered
DEBUG | 19:18:58 2803 [Router] Module 'nodeinfo' handled and skipped other processing

0x75e19a60 = TLORA_T3_S3
0xe2e2e1e8 = HELTEC_V3
0x2d786336 = RAK4631

Not sure about the firmware on these nodes since they are not mine.

@GUVWAF
Copy link
Member

GUVWAF commented Mar 19, 2024

Thanks for that. Payload length of 6 bytes is way too short for NodeInfo, not sure why this is incorrect while the rest of the info from the packet looks OK.

@asjmcguire
Copy link

It's a payload len of 6 further up too - on the logs @Obinice posted.
as well the very confusing part of:

DEBUG | 16:29:54 145 [Router] Original length - 6
DEBUG | 16:29:54 145 [Router] Compressed length - 10
DEBUG | 16:29:54 145 [Router] Original message - ▒떗

Which leaves me scratching my head as to how the compressed length is longer than the uncompressed length....

@GUVWAF
Copy link
Member

GUVWAF commented Mar 19, 2024

Indeed.

Which leaves me scratching my head as to how the compressed length is longer than the uncompressed length....

This can happen with the library it's using, but in that case it won't use the compressed version, as the next line shows. Actually compression is not working currently but that will be fixed in 3.0 and is unrelated to this issue.

@andrekir andrekir transferred this issue from meshtastic/Meshtastic-Android Mar 21, 2024
@geeksville
Copy link
Member

This issue has been mentioned on Meshtastic. There might be relevant details there:

https://meshtastic.discourse.group/t/2-3-1-traceroute-and-remote-radio-config-broken/11275/4

@GiuStel
Copy link

GiuStel commented Mar 21, 2024

As an additional element - I don't know if useful - I can tell that most of the nodes here have channel 1, 2 and 3 configured with a PSK.
Channel 0 is MediumFast with standard PSK.
Rolling back (tens of nodes) to version 2.2.24 is taking us back to the normal mesh behaviour.
We're availble for tests if needed. Thanks.

@grahamrobertslx
Copy link

grahamrobertslx commented Mar 21, 2024

So down graded all my nodes to 2.2.24 and still having the same issue. one node on web and the other on android 2.2.24.
Also...
iOS app to node on web app web app receives the ???? but iOS app says the message had an issue "Implicit ACK from another node". Message sent from web to iOS and the message isn't received.
Messages between iOS and Android work fine.

All with direct trace routes

@GiuStel
Copy link

GiuStel commented Mar 21, 2024

So down graded all my nodes to 2.2.24 and still having the same issue. one node on web and the other on android 2.2.24. Also... iOS app to node on web app web app receives the ???? but iOS app says the message had an issue "Implicit ACK from another node". Message sent from web to iOS and the message isn't received. Messages between iOS and Android work fine.

All with direct trace routes

can you tell that no other nodes with 2.3.x were between yours?

@grahamrobertslx
Copy link

So down graded all my nodes to 2.2.24 and still having the same issue. one node on web and the other on android 2.2.24. Also... iOS app to node on web app web app receives the ???? but iOS app says the message had an issue "Implicit ACK from another node". Message sent from web to iOS and the message isn't received. Messages between iOS and Android work fine.
All with direct trace routes

can you tell that no other nodes with 2.3.x were between yours?

They all trace route Direct but can't tell if there was a repeater in-between as i don't believe these are shown in the Trace Route as a hop

@GUVWAF
Copy link
Member

GUVWAF commented Mar 21, 2024

can you tell that no other nodes with 2.3.x were between yours?

Yes, if there were 2.3.x nodes in between they might not show up in the Traceroute if they have decoding errors.

Just a heads-up: I asked someone that could reproduce it with two nodes (one 2.2.24 and one 2.3.0) to try this branch, which is a modified version of 2.3.0 without the suspicious change and that seems to resolve it. So we know that the addition of hopStart to the MeshPacket causes this, but we don't know why. If we can't resolve it quickly, we'll revert this change to get a new beta out soon.

@grahamrobertslx
Copy link

Ace, if i can be of any more help just let me know. Also happy to test out the modified version of 2.3.0 if you need more experiences

can you tell that no other nodes with 2.3.x were between yours?

Yes, if there were 2.3.x nodes in between they might not show up in the Traceroute if they have decoding errors.

Just a heads-up: I asked someone that could reproduce it with two nodes (one 2.2.24 and one 2.3.0) to try this branch, which is a modified version of 2.3.0 without the suspicious change and that seems to resolve it. So we know that the addition of hopStart to the MeshPacket causes this, but we don't know why. If we can't resolve it quickly, we'll revert this change to get a new beta out soon.

Ace, if i can be of any more help just let me know. Also happy to test out the modified version of 2.3.0 if you need more experiences

@asjmcguire
Copy link

Just a heads-up: I asked someone that could reproduce it with two nodes (one 2.2.24 and one 2.3.0) to try this branch, which is a modified version of 2.3.0 without the suspicious change and that seems to resolve it. So we know that the addition of hopStart to the MeshPacket causes this, but we don't know why. If we can't resolve it quickly, we'll revert this change to get a new beta out soon.

I hope we don't have to lose HopStart and HopsAway functionality, but while I was taking a look at the revert - I noticed that via_mqtt has been deleted from NodeDB, is this meant? @GUVWAF

@GUVWAF
Copy link
Member

GUVWAF commented Mar 21, 2024

@asjmcguire Ah, I see. No, that wasn't meant. If we revert it we'll ensure it's done properly, this was just for testing.

@erayd
Copy link

erayd commented Mar 22, 2024

I'm not familiar with this codebase, but the following stood out to me:

Over LoRa, three previously unused bits in the header are now used for hopStart. Internally when receiving the firmware will parse these bits and put them in a protobuf field hopStart, which is new, but added at the end of the MeshPacket to ensure backwards compatibility.

Looking at mesh.proto and mesh.options, MeshPacket.hop_start is defined as 8 bits, not 3. Is it possible there's an overrun somewhere due to this difference? Where in the source does the conversion between 3 & 8 bits occur?

@GUVWAF
Copy link
Member

GUVWAF commented Mar 22, 2024

@erayd Thanks for thinking along.
Here it is put into the header flags (which is 8 bits in total):

h->flags |= (p->hop_start << PACKET_FLAGS_HOP_START_SHIFT) & PACKET_FLAGS_HOP_START_MASK;

Using:
#define PACKET_FLAGS_HOP_START_MASK 0xE0
#define PACKET_FLAGS_HOP_START_SHIFT 5

And here it is parsed from the header flags and stored in the struct:
mp->hop_start = (h->flags & PACKET_FLAGS_HOP_START_MASK) >> PACKET_FLAGS_HOP_START_SHIFT;

The strange thing is that the decoding errors only happen for the payload inside the Data message in the MeshPacket. This payload itself is again a protobuf message (or just a string for a text message) of a specific type given by the port number, but the length of this payload looks somehow incorrect.

@erayd
Copy link

erayd commented Mar 22, 2024

Thanks @GUVWAF - seems it wasn't what I was wondering; that looks sane enough to me 🙂

Will keep looking...

@asjmcguire
Copy link

Potentially what is needed is for a dev to add more verbose logging to 2.2.24 that spits out exactly (raw) what is received over the radio before any processing is done to it. So you can verify that what is received is what should be received. Then some logging to show what is being put out on the radio (again raw).

Then send a message from a 2.3.X node to another 2.3.X node, but with the destination node switched off, ensuring that the 2.2.24 node HAS to rebroadcast.

Then you will at least know if what is arriving at the 2.2.24 node is intact, that way we can at least rule out that 2.3.X is malforming something. That allows us to concentrate on whether it is getting malformed by 2.3.X or whether 2.2.24 is responsible for doing something strange with it, and whether that is during the decoding phase, or whether it is when rebuilding the packet for retransmission.

@eureekasigns
Copy link

This issue has been mentioned on Meshtastic. There might be relevant details there:

https://meshtastic.discourse.group/t/2-3-1-traceroute-and-remote-radio-config-broken/11275/4

Yes I'll carry on here vs the main forum as this is a very specific and seemingly sporadic issue.

I did update the app and could trace between the two nodes as well as use admin channel. Otherwise the main one would not do a direct trace when right next to both.

From a distance, it showed direct.

Downgraded one node and can trace with app store version of app.

Of note, the remote admin channel radio config also failed with the non alpha app.

On hardware, both are RAK boards, 19007. One is older micro USB, the other is C. I have not seen a garbled message but have not tried many. Can connect both nodes to gather console output if needed from a PC but it sounds like what others have found is minimal since the output may need other debug flags enabled to see more raw data?

Only other thing I have going is multiple channels with longfast as a later one in the list. If adding channels is part of the issue, perhaps that is a factor, I do not know.

If there is something I can do to test further, would be glad to try today. I do not know if I can force it through another hop with 2.2.x or whatever is in the area unless there's some way to force it to use a neighboring node.

@eureekasigns
Copy link

eureekasigns commented Mar 22, 2024

I don't know if this is relevant, but reading the release notes I noticed that the following was updated.
Could this relate at all to the issue between versions of nodes (and/or potentially app version)?

#3393
0333eb9

Notably the following parts:
platform = platformio/nordicnrf52@^10.4.0

Among changes there, the following is of interest.
https://github.com/platformio/platform-nordicnrf52/releases/tag/v10.2.0
which shows: arduino/ArduinoCore-mbed#710
Which contains:
Add ISGR Root X1 (Let's Encrypt) to CAs
Add instructions to manage certificates.h and cacert.pem

...So, if there is a difference in certificates and/or management of them, could this possibly explain the "garble" that is seen in some cases?

It is also of interest that in some cases, it is not known whether or not a message is going to the same "channel" (if unknown), correct? In the case of encrypted message, it may be passing through a "last known" node on what it believes should be "primary" ... but landing on a node's 3rd channel, for example. I am still trying to understand that behavior from a post on the discourse board -

https://meshtastic.discourse.group/t/question-about-private-messages-with-multiple-channels/11092

"It will use the channel where it most recently heard a Node Info packet from the destination on. That’s most likely the destination’s primary, but can be a secondary."

This may all be unrelated but seemed like a possibility after looking through the above and thought it worth mention.

@thebentern
Copy link
Contributor

Certificates don't have anything to do with LoRA, and that change happened in 2.3.1. The issue manifests in 2.3.0 as well.

@asjmcguire
Copy link

asjmcguire commented Mar 22, 2024

The issue manifests in 2.3.0 as well.

And apparently goes away when hopStart is removed from the mesh packet, which is depressing, because Hops Away is the most exciting feature in a long while.

@thebentern
Copy link
Contributor

And apparently goes away when hopStart is removed from the mesh packet, which is depressing, because Hops Away is the most exciting feature in a long while.

I agree. This is why we're digging in to see if it's something we can pinpoint with the decoding / encoding and not just yanking the change yet.

@GUVWAF
Copy link
Member

GUVWAF commented Mar 22, 2024

Potentially what is needed is for a dev to add more verbose logging to 2.2.24 that spits out exactly (raw) what is received over the radio before any processing is done to it.

That's indeed a good idea. Since I still cannot reproduce it (tried other devices as well) I created branches for both 2.2.24 and 2.3.0 with debug messages containing the raw bytes. Hopefully the artifacts with compiled binaries will be available when the CI finishes here for 2.2.24 and here for 2.3.0, otherwise you'll also need to compile it yourself if you want to test.

@GUVWAF
Copy link
Member

GUVWAF commented Mar 23, 2024

Good news: I could reproduce and found that the root cause is the NeighborInfo Module. This is because with hopStart we can also determine who are neighbors for any other packet. I got full logs from another user and found that the NeighborInfo module was altering the packet even if it wasn’t a NeighborInfo packet. Even when that resulted into decoding errors, it would overwrite the content with garbage. It can only be reproduced if a node with NeighborInfo module enabled on 2.3.0 or 2.3.1 is in between or when testing with two devices on 2.3.0/2.3.1 if both have them enabled.

It is fixed by #3474. For now if you use 2.3.0 or 2.3.1, please don’t enable the NeighborInfo module.

Many thanks to everyone for providing logs or helping in other ways!
I’m sorry for the inconvenience but happy that we don’t have to revert the hopStart addition to the MeshPacket.

@dantuf
Copy link

dantuf commented Mar 23, 2024

Very good news indeed! Nice work! I can confirm your fix appears to solve the problem.

@asjmcguire
Copy link

That'll explain why I never saw the problem, I turned NeighbourInfo module off as soon as I updated to the first 2.3.0 alpha, on the basis that any packet with a HopsAway of 0, is clearly a neighbour, thus the NeighbourInfo module is essentially redundant right?

@GUVWAF
Copy link
Member

GUVWAF commented Mar 23, 2024

on the basis that any packet with a HopsAway of 0, is clearly a neighbour, thus the NeighbourInfo module is essentially redundant right?

Yes and no. Only if all nodes are using >=2.3.0, this can be guaranteed. Furthermore, if you enable the NeighborInfo module also on nodes that you don't connect to with your phone, you receive their NeighborInfo packets, so you know which nodes are their neighbors. When collecting these packets e.g. via MQTT, you can draw a graph of how all nodes are connected.

thebentern pushed a commit that referenced this issue Mar 23, 2024
* Use `alterReceivedProtobuf()` for NeighborInfo and Traceroute
`alterReceived()` should never return NULL
Traceroute should be promiscuous

* Remove extensive logging from NeighborInfo module
mverch67 pushed a commit that referenced this issue Mar 24, 2024
* Use `alterReceivedProtobuf()` for NeighborInfo and Traceroute
`alterReceived()` should never return NULL
Traceroute should be promiscuous

* Remove extensive logging from NeighborInfo module
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.