DRAFT: Add hardware watchdog for the RP2040 platform #4349

The3rdPlace · 2024-07-29T20:21:28Z

The watchdog must be enabled late in the boot stage since the rp2040's watchdog only allow for 8 seconds delay before resetting, and some of the initialization calls after the call to rp2040Setup() and even after the first call to rp2040Loop() blocks execution long enough to exceed this delay.

Due to this architectural problem, we still have a potential freeze on the rp2040 if it hangs during boot and initialization, but practical experience seems to indicate that in most cases, a freeze happens long into the running state (often days), so unless one builds a node to be deployed deep into the wilderness, we accept this risk.

There is not any likely solution to the short delay counter in the hardware watchdog since this is based on hardware registers.

GUVWAF · 2024-07-30T15:27:49Z

The downside of adding a watchdog is that it hides (potential serious) bugs. Especially since the maximum timeout you can set is relatively short and it requires workarounds like this to make it work, I'm hesitant about it. For nRF52 we don't have a watchdog and for ESP32 it was increased from 45 to 90 seconds.

Have you confirmed it's properly resetting it when it receives a lot of packets back-to-back, reconnects to a client app when you have a lot of nodes in the DB, or reconnects to your Wi-Fi AP (this delay might be problematic), etc.?

The watchdog must be enabled late in the boot stage since the rp2040's watchdog only allow for 8 seconds delay before resetting, and some of the initialization calls after the call to rp2040Setup() and even after the first call to rp2040Loop() blocks execution long enough to exceed this delay. Due to this architectural problem, we still have a potential freeze on the rp2040 if it hangs during boot and initialization, but practical experience seems to indicate that in most cases, a freeze happens long into the running state (often days), so unless one builds a node to be deployed deep into the wilderness, we accept this risk. There is not any likely solution to the short delay counter in the hardware watchdog since this is based on hardware registers.

The3rdPlace · 2024-08-13T20:04:21Z

Sorry for the late answer, vacation time got me :-D

I agree on the 8 seconds being to little time, my test devices had uptimes between 3 and 14 hours, so clearly a problem.

I have reworked the solution to provide the same 90 seconds timeout as for the esp32 device, I will mark this PR as draft until my devices has run for some days (aprox. 32 hours and counting into testing with my main node).

GUVWAF · 2024-08-15T06:53:00Z

src/platform/rp2040/rp2040Watchdog.h

+
+            // Update watchdog if timeout is below 90 seconds, same as esp32 watchdog
+            if (timeout < 90 * 1000) {
+                watchdog_update();


This is not going to work, right? In the worst case you only call watchdog_update() just before the 90 seconds expires, but it already triggers after 8 seconds.

timeout needs to be below 90 seconds, in most cases its close to 0 (zero) (due to soft-updates from the loop or other places), so it will be called aprox. each. 4. second - but I will add a trace line so that we can verify this. np.

I mean I don't see why you're checking for less than 90 seconds? It will never be more than 8 seconds, because then it reboots already.

GUVWAF · 2024-08-15T07:03:02Z

Sorry for the late answer, vacation time got me :-D

No problem, we're all doing this for fun :)

my test devices had uptimes between 3 and 14 hours, so clearly a problem.

If you have such short uptimes, it would be relatively easy to get to the root cause of the problem. You can e.g. connect another Pico as a debugger. The callstack when it hangs will likely give us a clue what's going wrong - it might be even something in the Arduino Pico core, like we had in e.g. #2558.

If it's now going to reboot every 3 hours, that's not a good solution either IMO, as you'll lose everything in RAM (e.g. received packet queue to be delivered to a phone), and every time it boots it sends out its NodeInfo to everyone and asks for a request, which leads to a storm of packets if it's in a good position.

GUVWAF · 2024-08-15T07:10:15Z

src/platform/rp2040/rp2040Watchdog.h

+        }
+
+        // Time until this thread runs again
+        return 4 * 1000; // 4 seconds


This is just what you ask it to do. If it doesn't have time to schedule this thread, it won't run this fast.

ArduinoThread (which this utilizes underneath) is not a real-time OS:

It should be noted that these are not “threads” in the real computer-science meaning of the term: tasks are implemented as functions that are run periodically. On the one hand, this means that the only way a task can yield the CPU is by returning to the caller, and it is thus inadvisable to delay() or do long waits inside any task.

Correct, so if some other threads are doing lengthy work in such a thread, this is absolutely a problem and we should find out where this happens (we already knows that wifi init is blocking!). If all other threads behaves well, and schedules lengthy work to slower loops or queues, this should work, but we can schedule it faster to leave more headroom in case calls is a bit delayed.

I will add some diagnostics output while fieldtesting to get some indication of the precision of the scheduler.

we should find out where this happens

Yes, I agree, but it can be anywhere in the firmware, under any kind of scenario. It's very hard to guarantee that if you fix one case, it won't appear somewhere else and if it leads to an endless reboot loop, that would be a nasty regression.

The3rdPlace · 2024-08-15T08:40:55Z

Just to make it clear :-D the 3 hour reboots was after having only the 8 seconds timeout, so no need to debug into that ;)

I am fieldtesting the current code now to check what uptimes we get with the current revision, then I also need to make some tests to find a baseline - all of this takes some time, so I'll just leave it as a draft until I am a bit wiser on the general stability.

GUVWAF · 2024-08-15T08:48:29Z

Just to make it clear :-D the 3 hour reboots was after having only the 8 seconds timeout, so no need to debug into that ;)

Ah, I see. Likely it wasn't returning to rp2040loop() often enough.
But still, in your current revision there's only an 8 second timeout. It's fixed in the hardware, I believe there's no way around it.

fifieldt · 2024-10-08T08:21:03Z

Hi @The3rdPlace , just checking in to see how your tests went ...

fifieldt · 2024-10-28T02:59:12Z

Closing this to clean our pull request queue since it's been waiting on progress for a while. Feel free to re-propose any time!

thebentern requested a review from caveman99 July 30, 2024 14:11

The3rdPlace added 3 commits August 12, 2024 11:56

Rework rp2040 watchdog so that we get longer timeouts than 8 seconds

0ef9273

Remove unneeded trace output

c356f78

The3rdPlace force-pushed the add-rp2040-hw-watchdog branch from 3b21102 to c356f78 Compare August 13, 2024 20:00

The3rdPlace marked this pull request as draft August 13, 2024 20:04

The3rdPlace changed the title ~~Add hardware watchdog for the RP2040 platform~~ DRAFT: Add hardware watchdog for the RP2040 platform Aug 13, 2024

Revert modification of vscode project file

c3c4a8c

GUVWAF reviewed Aug 15, 2024

View reviewed changes

fifieldt closed this Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: Add hardware watchdog for the RP2040 platform #4349

DRAFT: Add hardware watchdog for the RP2040 platform #4349

The3rdPlace commented Jul 29, 2024

GUVWAF commented Jul 30, 2024

The3rdPlace commented Aug 13, 2024

GUVWAF Aug 15, 2024

The3rdPlace Aug 15, 2024

GUVWAF Aug 15, 2024

GUVWAF commented Aug 15, 2024

GUVWAF Aug 15, 2024 •

edited

Loading

The3rdPlace Aug 15, 2024

GUVWAF Aug 15, 2024

The3rdPlace commented Aug 15, 2024

GUVWAF commented Aug 15, 2024

fifieldt commented Oct 8, 2024

fifieldt commented Oct 28, 2024

DRAFT: Add hardware watchdog for the RP2040 platform #4349

DRAFT: Add hardware watchdog for the RP2040 platform #4349

Conversation

The3rdPlace commented Jul 29, 2024

GUVWAF commented Jul 30, 2024

The3rdPlace commented Aug 13, 2024

GUVWAF Aug 15, 2024

Choose a reason for hiding this comment

The3rdPlace Aug 15, 2024

Choose a reason for hiding this comment

GUVWAF Aug 15, 2024

Choose a reason for hiding this comment

GUVWAF commented Aug 15, 2024

GUVWAF Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

The3rdPlace Aug 15, 2024

Choose a reason for hiding this comment

GUVWAF Aug 15, 2024

Choose a reason for hiding this comment

The3rdPlace commented Aug 15, 2024

GUVWAF commented Aug 15, 2024

fifieldt commented Oct 8, 2024

fifieldt commented Oct 28, 2024

GUVWAF Aug 15, 2024 •

edited

Loading