Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new ROUTER_LATE role #5528

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

erayd
Copy link

@erayd erayd commented Dec 8, 2024

This PR adds a new ROUTER_LATE role.

This role will always rebroadcast packets, but will do so after all other modes. Intended for router nodes that are there to provide additional coverage in areas not already covered by other routers, or to bridge around problematic terrain, but should not be given priority over other routers in order to avoid unnecessaraily consuming hops (basically the "we need this thing to always rebroadcast, but it isn't a 'real' high site" scenario).

By default, this role will rebroadcast during the normal client window. However, if another node is overheard rebroadcasting the packet, then it will be moved to a second window after the normal client one, with the same timing behaviour.

Relevant discussion with @GUVWAF on Discord is here. This feature originally started life as just a configurable delay, and following that discussion has morphed into this new role 🙂.

Please note that this PR requires that these protobuf changes are applied first. I have omitted the protobuf regen commit from this PR, as I'm unsure whether you want it bundled or would prefer to apply those changes separately beforehand.

To assist with testing, a directly buildable version of this based on v2.5.15 with the protobuf regen commit included can be found here.

@erayd
Copy link
Author

erayd commented Dec 8, 2024

@thebentern Did you intend to merge 33494c3 into this PR, or was that a mistake? It doesn't seem related at all...

@fifieldt
Copy link
Contributor

fifieldt commented Dec 8, 2024

Hi @erayd , don't panic! This is just us keeping your development branch in-sync with upstream master so it can merge more easily. If you have a look at the Files tab, you'll see nothing about your patch has changed.

@erayd
Copy link
Author

erayd commented Dec 8, 2024

@fifieldt Thanks for the clarification, and that makes sense 🙂. It was just unexpected, so I was wondering if they had accidentally merged into the wrong branch.

@GUVWAF
Copy link
Member

GUVWAF commented Dec 9, 2024

Looking very good already!

However, I had it slightly differently in my mind. To make sure you are not still sending faster than busy routers/clients, I think we really need to restart the timer with the original delay. For example, when receiving a packet with a long airtime in between (or even when you had already multiple packets to be transmitted in the queue), you might already pass the remaining transmit delay when the packet is in front of the queue, and then you start transmitting directly after the delay from startTransmitTimer(). This is what another router or client then also does, and it depends on the random number drawn who will be faster.

I think it makes sense to change startTransmitTimer() here:

startTransmitTimer();

and here:
startTransmitTimer();

to setTransmitDelay(), for all roles. This is actually more correct, since it will determine the delay based on SNR (if available), instead of channel utilization. It won’t affect current firmware much, because the delay range is still the same.

Then, with this, you can just check whether the packet in front of the queue has the Delayed tag set, and then apply the additional delay, without needing the tx_after field.

The downside is that sometimes indeed you might start it over and that creates additional latency, but I think that's justified, because this usually only happens when the mesh is busy.

@erayd
Copy link
Author

erayd commented Dec 9, 2024

@GUVWAF

Then, with this, you can just check whether the packet in front of the queue has the Delayed tag set, and then apply the additional delay, without needing the tx_after field.

The downside is that sometimes indeed you might start it over and that creates additional latency, but I think that's justified, because this usually only happens when the mesh is busy.

The problem with doing this is when multiple of these are involved that can overhear each other, the latency added can be quite considerable. The problem then becomes DoS via excessive delay.

Would resuming with max(0, remaining tx_after) + regular SNR-based delay be an acceptable compromise? That way your desired keep-adding-slight-delay behaviour is there, but the extra portion of the delay is only added once.

@GUVWAF
Copy link
Member

GUVWAF commented Dec 9, 2024

Would resuming with max(0, remaining tx_after) + regular SNR-based delay be an acceptable compromise?

Sounds good, considering this is an edge case for not so busy meshes anyway. Although a DoS is a big word, because it's already an additional service compared to a normal client.

So if you're using the new tx_after field, I don't think the Delayed tag is necessary anymore?

@erayd
Copy link
Author

erayd commented Dec 9, 2024

Sounds good, considering this is an edge case for not so busy meshes anyway.

Sweet, will implement accordingly.

Although a DoS is a big word, because it's already an additional service compared to a normal client.

DoS seems applicable here, because these things fill an essential role in routing around terrain. They need to not get delayed so long as to be rendered useless (e.g. to cause remote admin packets to time out etc).

The reason they are in the late window isn't because they are unimportant, but because they have rather poor coverage characteristics, and should give better placed sites a shot at rebroadcasting first.

So if you're using the new tx_after field, I don't think the Delayed tag is necessary anymore?

I'm using both, for different purposes. tx_after to keep track of the delay amount, and delayed to ensure that the extra delay only gets added once (instead of once per dupe heard).

@GUVWAF
Copy link
Member

GUVWAF commented Dec 9, 2024

They need to not get delayed so long as to be rendered useless (e.g. to cause remote admin packets to time out etc).

True, but this risk is still there, because it only gets the minimum priority after adding the delay, so if someone is sending multiple packets after each other, the newer packet gets inserted before one that already is being delayed. (So perhaps a re-ordering is also needed after setting the priority to minimum?)

delayed to ensure that the extra delay only gets added once (instead of once per dupe heard).

Since the tx_after is only really needed if you add this additional delay, it could be applied only there and a non-zero value for tx_after would indicate it's already being delayed.

@erayd
Copy link
Author

erayd commented Dec 9, 2024

True, but this risk is still there, because it only gets the minimum priority after adding the delay, so if someone is sending multiple packets after each other, the newer packet gets inserted before one that already is being delayed. (So perhaps a re-ordering is also needed after setting the priority to minimum?)

Good point. Yes.

Since the tx_after is only really needed if you add this additional delay, it could be applied only there and a non-zero value for tx_after would indicate it's already being delayed.

That works 🙂

@erayd erayd changed the title Add new ROUTER_INFILL role Add new ROUTER_LATE role Dec 9, 2024
@erayd
Copy link
Author

erayd commented Dec 10, 2024

@GUVWAF Have made a bunch of changes to the timing & the queuing following our discussion above. Is this now more in line with what you were expecting?

The queuing changes are to ensure that late packets go to the back of the queue, but can still have their priority respected within that domain.

@GUVWAF
Copy link
Member

GUVWAF commented Dec 10, 2024

The current approach will not actually use the additional delay after transmitting or receiving another packet. It only applies this when the channel is active, but it’s likely that this will result in an Rx interrupt, which will overwrite the delay, and call setTransmitDelay() afterwards.

What if you just always use setTransmitDelay()? When receiving a duplicate packet, if tx_after is not yet set, set it only to millis() + getTxDelayMsecWeightedWorst(p->rx_snr). Then in setTransmitDelay(), set the notification for max(tx_after + getTxDelayMsecWeighted(p->rx_snr), millis() + getTxDelayMsecWeighted(p->rx_snr)) - millis(). Checking for tx_after > millis() within TX_DELAY_COMPLETED is then also not necessary.

The reason I am so picky about this is that if we want this to be a general available role, we should really make sure it does not actually transmit faster than routers/clients, and it shouldn’t increase the chance of collisions too much (so also ensure it’s really outside of the window for regular routers/clients).

@erayd
Copy link
Author

erayd commented Dec 11, 2024

The current approach will not actually use the additional delay after transmitting or receiving another packet. It only applies this when the channel is active, but it’s likely that this will result in an Rx interrupt, which will overwrite the delay, and call setTransmitDelay() afterwards.

Thanks - good catch 🙂

What if you just always use setTransmitDelay()? When receiving a duplicate packet, if tx_after is not yet set, set it only to millis() + getTxDelayMsecWeightedWorst(p->rx_snr). Then in setTransmitDelay(), set the notification for max(tx_after + getTxDelayMsecWeighted(p->rx_snr), millis() + getTxDelayMsecWeighted(p->rx_snr)) - millis().

I like this approach. Yes.

Checking for tx_after > millis() within TX_DELAY_COMPLETED is then also not necessary.

This still seems to be necessary. During testing, TX_DELAY_COMPLETED often seems to fire when the requested delay has not yet expired (maybe due to interrupts?), so i think the check is important unless we can find and prevent whatever is causing it to return early.

The reason I am so picky about this is that if we want this to be a general available role, we should really make sure it does not actually transmit faster than routers/clients, and it shouldn’t increase the chance of collisions too much (so also ensure it’s really outside of the window for regular routers/clients).

Picky is good (and I agree with your reasoning). It's important that this be robust 🙂

@erayd
Copy link
Author

erayd commented Dec 11, 2024

If you're wondering where the commit that implements the above changes is... it's coming, but there is an odd bug I need to fix first. Will push it once I have nailed this down.

The Bug:
For some reason, tx_after is arriving at setTransmitDelay with an initial value of precisely 52717827, for every packet, even ones that were not initially delayed, and regardless of whether I explicitly initialise that field to zero earlier (and even though the entire packet object is allocated zero-filled anyway). I currently have no idea where this value is originating.

@GUVWAF
Copy link
Member

GUVWAF commented Dec 11, 2024

Hmm, I'm assuming you applied all the protobuf changes on top of the latest release, right? If the tx_after would be at the same place where e.g. now pki_encrypted is and the app sets this, this might cause incompatibility.

I don't see how this could happen otherwise.

@erayd
Copy link
Author

erayd commented Dec 11, 2024

Hmm, I'm assuming you applied all the protobuf changes on top of the latest release, right? If the tx_after would be at the same place where e.g. now pki_encrypted is and the app sets this, this might cause incompatibility.

Pretty sure I have. It compiles properly with code that refers to those fields, which IMO is an indication that the changes are there... I'd expect to see build errors otherwise. With that said, it seems like a good place for me to look. I might try a full clean & rebuild, and see what that does - maybe there's a bit of the old version lurking around somewhere that shouldn't be.

I don't see how this could happen otherwise.

Yeah, it's confusing as heck. The data is clearly originating somewhere, and given it's not getting there via assignment to tx_after, that does point to something else walking all over its memory space. I'll pursue your suggestion above about maybe something missing the protobuf update - as you say, that does seem like the most likely mechanism.

@GUVWAF
Copy link
Member

GUVWAF commented Dec 15, 2024

Have you been able to figure it out? At least with the current state of your router-infill branch, I cannot reproduce it.
I did see a crash, however, because setTransmitDelay() should now also check whether getFront() doesn't return NULL.

@erayd
Copy link
Author

erayd commented Dec 15, 2024

Have you been able to figure it out? At least with the current state of your router-infill branch, I cannot reproduce it.

I've been extremely short on time the last few days, so haven't yet had the opportunity to dig into it. Hoping to do so either this evening or tomorrow.

I did see a crash, however, because setTransmitDelay() should now also check whether getFront() doesn't return NULL.

Thanks - will add a check for this 🙂

@Kealper
Copy link
Contributor

Kealper commented Dec 17, 2024

Definitely interested in this new role, it would help out in the region I'm in that's a mix of mountains and hills. Hills are the easy part but so far there's been some issues trying to get packets to reliably get around mountains that aren't able to have regular Router nodes on them.

@erayd
Copy link
Author

erayd commented Dec 21, 2024

@GUVWAF What do you reckon now?

The memory stomp appears resolved following full protobuf rebuild & adding that null check after getFront().

@GUVWAF
Copy link
Member

GUVWAF commented Dec 21, 2024

Good to hear! It looks good to me now, thanks for your cooperation.

Let's wait for more approvals on the protobuf definitions.

@erayd
Copy link
Author

erayd commented Dec 21, 2024

Sounds good. Thanks very much for your input and assistance with this feature 😁

@erayd erayd marked this pull request as ready for review December 21, 2024 09:00
@erayd erayd marked this pull request as draft December 21, 2024 09:02
@erayd erayd marked this pull request as ready for review December 21, 2024 09:03
@erayd erayd force-pushed the router-infill-master branch from 8de1024 to bbc2a0e Compare December 21, 2024 10:55
@erayd
Copy link
Author

erayd commented Dec 21, 2024

Have squashed the commits to make it easier to wrangle.

@erayd erayd requested a review from GUVWAF December 21, 2024 21:43
@fifieldt
Copy link
Contributor

Protobufs appear merged

@fifieldt fifieldt force-pushed the router-infill-master branch from bbc2a0e to 7dc330d Compare December 24, 2024 00:45
@erayd
Copy link
Author

erayd commented Dec 24, 2024

@fifieldt

Protobufs appear merged

They're merged in the protobuf repo master branch. Not yet in the firmware repo though.

Do you want the protobuf update in this repo to happen separately, or should I add a commit for that here (bumping the submodule commit & regen update) as part of this PR?

@fifieldt
Copy link
Contributor

Ah! let me trigger that job, moment.

@fifieldt
Copy link
Contributor

( #5658 )

@fifieldt fifieldt force-pushed the router-infill-master branch from 7dc330d to f92c5e4 Compare December 26, 2024 02:43
Will always rebroadcast packets, but will do so after all other modes.
Intended for router nodes that are there to provide additional coverage
in areas not already covered by other routers, or to bridge around
problematic terrain, but should not be given priority over other routers
in order to avoid unnecessaraily consuming hops.

By default, this role will rebroadcast during the normal client window.
However, if another node is overheard rebroadcasting the packet, then it
will be moved to a second window *after* the normal client one, with the
same timing behaviour.
@fifieldt fifieldt force-pushed the router-infill-master branch from f92c5e4 to 1a7adf9 Compare December 26, 2024 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants