Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mask out random bits when doing queue ordering #4561

Merged
merged 2 commits into from
Aug 26, 2024
Merged

Conversation

jp-bennett
Copy link
Collaborator

Unintended consequence of randomizing top bits of Packet IDs was that we do use those values to determine packet order in the queue. Simple fix, to mask out the random bits when doing that comparison.

@jp-bennett
Copy link
Collaborator Author

Fixes #4560

Copy link
Contributor

@gitbisector gitbisector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using parenthesis #define ID_COUNTER_MASK (UINT32_MAX >> 22) is safer.

@jp-bennett
Copy link
Collaborator Author

Using parenthesis #define ID_COUNTER_MASK (UINT32_MAX >> 22) is safer.

That's fair.

@gitbisector
Copy link
Contributor

The other thing I wonder about is packet_ids coming from other nodes. Are they mixed with locally generated ones? That's not specific to this MSB randomization but may still cause trouble? Is there no access to the RX timestamp here? Saw code where locally generated packets get a timestamp put on.

@thebentern thebentern merged commit b9a8683 into master Aug 26, 2024
103 checks passed
@thebentern thebentern deleted the mask-random branch August 26, 2024 20:48
@gitbisector
Copy link
Contributor

@thebentern sorry to ping but while this is still fresh. 'rx_time' field has your name on it. As pkt->id is generated on each transmitting node using it to sort when priorities are equal is not fair. Propose using rx_time instead. Locally created packets get rx_time asssigned too in router.cpp:allocForSending()

return (p1p != p2p) ? (p1p < p2p) // prefer bigger priorities
: (p1->rx_time >= p2->rx_time); // prefer earlier packets

@GUVWAF
Copy link
Member

GUVWAF commented Aug 27, 2024

I'm not sure if that's a good idea. Packets you receive for rebroadcasting might have been generated much earlier than you received it. If you wait too long for rebroadcasting, another node that is not in an advantageous position might jump in.

Currently it's not perfect either, because it will be random, but I'm not sure whether there is a better solution. However, having multiple packets with the same priority in the Tx queue is already not a common situation, unless your mesh is very busy.

@gitbisector
Copy link
Contributor

gitbisector commented Aug 27, 2024

If only it was truly random. Trouble with the current approach is that in a busy mesh, unlucky packets with a 'bad' id can sit in a node almost forever as the tx_queue barely ever goes empty. When they finally come out, some packets might even get seen as 'new' rather than duplicate as they have been forgotten. While not perfect (we don't have complete information), using rx_time does improve on this.

@GUVWAF
Copy link
Member

GUVWAF commented Aug 27, 2024

@gitbisector If we want to optimize this even further, I still think packets not generated locally should have priority for the following reasons:

  1. Packets received on the mesh are at least generated the airtime of the packet before the rx_time and possible longer if they went through multiple hops already.
  2. The delay a node has before rebroadcasting actually determines who rebroadcasts, because others will cancel it afterwards. If you give priority to your own packets first, another node might jump in, even though it has a high SNR and thus is not likely to contribute to routing.
  3. Client applications are notified with the current queue status. If it's full, it's a good indication to slow down your packet generation by doing queuing in the application. Packets to be relayed, on the other hand, will simply be dropped without notification to the original transmitter.
  4. It decreases the chance of judging a received packet as new instead of duplicate.

@gitbisector
Copy link
Contributor

Agree, am not at all advocating local packets should get priority. As it stands, sorting within a priority based on ids generated by different sources is causing bias towards low packet ids putting a huge tail latency on packets with large ids. In busy meshes (at least ours) that is making for extra stale packets were we even laugh about how stale they are. This applies to router/repeater nodes too as even they spend more time listening than transmitting keeping the queue from emptying.

@GUVWAF
Copy link
Member

GUVWAF commented Aug 27, 2024

Agree, am not at all advocating local packets should get priority.

Okay, but then there should be a different strategy than looking at the rx_time only, as a client application or the device itself can fill the Tx queue much faster than the packets are received and transmitted by the LoRa radio.

Problem is also that the rx_time is set by getValidTime(RTCQualityFromNet). If we don't have a proper notion of time, it won't be set.

@gitbisector
Copy link
Contributor

gitbisector commented Aug 27, 2024

Problem is also that the rx_time is set by getValidTime(RTCQualityFromNet). If we don't have a proper notion of time, it won't be set.

Didn't realize time can stand still. For same priority sorting we do need a node-local incrementing count. Can be a static inside of meskpacketqueue that increments in enqueue() but need a spot in meshtastic_MeshPacket to put it. (Sorting on all equal time within priority might still work but would need to experiment to verify)

@GUVWAF
Copy link
Member

GUVWAF commented Aug 27, 2024

How we sort items does not change that.

True, but even if it doesn't occupy it fully, it does influence in which order they are sent out, with the consequences I laid out earlier. The input from the device or client itself is asynchronous to the LoRa radio, and if you transmit new packets every time, you even delay others from rebroadcasting.

For mesh scaling purposes the app/device should get a small part.

I'm not sure all client apps currently properly handle the case where the Tx queue is full, I would not be confident by given them a smaller part.

@gitbisector
Copy link
Contributor

Okay, but then there should be a different strategy than looking at the rx_time only, as a client application or the device itself can fill the Tx queue much faster than the packets are received and transmitted by the LoRa radio.

True, client/device can easily occupy all (16) queue slots. How we sort items does not change that. Limiting sources to only a fraction of the maximum queue space could help with that. For mesh scaling purposes the app/device should get a small part.

@gitbisector
Copy link
Contributor

gitbisector commented Aug 27, 2024

I'm not sure all client apps currently properly handle the case where the Tx queue is full, I would not be confident by given them a smaller part.

Alternative is to make the queue maxlen bigger and keep the size for client apps the same :)

@GUVWAF
Copy link
Member

GUVWAF commented Aug 27, 2024

Alternative is to make it bigger :)

Yeah, but we're already talking about an edge case here. If your mesh is this busy, you'll likely have congestion issues already, which may be the cause of seeing "stale" packets due to high-priority packets coming in between.

Increasing the Tx queue size and adding a node-local counter doesn't come for free.

@gitbisector
Copy link
Contributor

Alternative is to make it bigger :)

Yeah, but we're already talking about an edge case here. If your mesh is this busy, you'll likely have congestion issues already, which may be the cause of seeing "stale" packets due to high-priority packets coming in between.

Increasing the Tx queue size and adding a node-local counter doesn't come for free.

We can keep meshtastic_MeshPacket the same size by keeping the count local to the queue. The TX queue size is very modest.

Meshtastic is growing (and want it to) so is it really an edge case if we are already seeing it on a daily basis?

@GUVWAF
Copy link
Member

GUVWAF commented Aug 27, 2024

We can keep meshtastic_MeshPacket the same size by keeping the count local to the queue. The TX queue size is very modest.

I won't discourage you from implementing a cost-effective solution of course ;)

Meshtastic is growing (and want it to) so is it really an edge case if we are already seeing it on a daily basis?

Meshtastic itself is, but there is just a limit to how much traffic a mesh can cope with. Switching to a faster modem preset is likely a better solution overall.
However, are you sure what you're seeing on a daily basis is caused by this specific issue? Isn't the priority causing the stale packets? Before this PR, the IDs were truly random. Busy meshes also have extra long delays due to high channel utilization, experience collisions more regularly causing retransmissions, etc.

@gitbisector
Copy link
Contributor

I can further instrument the priority queue to provide histograms. From our back-and-forth here, I am walking away that using rx_time fixes a real bug where large packet ids get penalized. Besides that bug, there is also the issue that device/apps can easily fill up tx and figuring out a clean mechanism to better share the bandwidth might help with scaling too.

geeksville pushed a commit to geeksville/Meshtastic-esp32 that referenced this pull request Aug 28, 2024
* Mask out random bits when doing queue ordering

* Parenthesis
@GUVWAF
Copy link
Member

GUVWAF commented Aug 30, 2024

@gitbisector This might help as well if you were mainly seeing stale text messages: #4592

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants