-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAFT: Add hardware watchdog for the RP2040 platform #4349
Conversation
The downside of adding a watchdog is that it hides (potential serious) bugs. Especially since the maximum timeout you can set is relatively short and it requires workarounds like this to make it work, I'm hesitant about it. For nRF52 we don't have a watchdog and for ESP32 it was increased from 45 to 90 seconds. Have you confirmed it's properly resetting it when it receives a lot of packets back-to-back, reconnects to a client app when you have a lot of nodes in the DB, or reconnects to your Wi-Fi AP (this delay might be problematic), etc.? |
The watchdog must be enabled late in the boot stage since the rp2040's watchdog only allow for 8 seconds delay before resetting, and some of the initialization calls after the call to rp2040Setup() and even after the first call to rp2040Loop() blocks execution long enough to exceed this delay. Due to this architectural problem, we still have a potential freeze on the rp2040 if it hangs during boot and initialization, but practical experience seems to indicate that in most cases, a freeze happens long into the running state (often days), so unless one builds a node to be deployed deep into the wilderness, we accept this risk. There is not any likely solution to the short delay counter in the hardware watchdog since this is based on hardware registers.
3b21102
to
c356f78
Compare
Sorry for the late answer, vacation time got me :-D I agree on the 8 seconds being to little time, my test devices had uptimes between 3 and 14 hours, so clearly a problem. I have reworked the solution to provide the same 90 seconds timeout as for the esp32 device, I will mark this PR as draft until my devices has run for some days (aprox. 32 hours and counting into testing with my main node). |
|
||
// Update watchdog if timeout is below 90 seconds, same as esp32 watchdog | ||
if (timeout < 90 * 1000) { | ||
watchdog_update(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not going to work, right? In the worst case you only call watchdog_update()
just before the 90 seconds expires, but it already triggers after 8 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeout needs to be below 90 seconds, in most cases its close to 0 (zero) (due to soft-updates from the loop or other places), so it will be called aprox. each. 4. second - but I will add a trace line so that we can verify this. np.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean I don't see why you're checking for less than 90 seconds? It will never be more than 8 seconds, because then it reboots already.
No problem, we're all doing this for fun :)
If you have such short uptimes, it would be relatively easy to get to the root cause of the problem. You can e.g. connect another Pico as a debugger. The callstack when it hangs will likely give us a clue what's going wrong - it might be even something in the Arduino Pico core, like we had in e.g. #2558. If it's now going to reboot every 3 hours, that's not a good solution either IMO, as you'll lose everything in RAM (e.g. received packet queue to be delivered to a phone), and every time it boots it sends out its NodeInfo to everyone and asks for a request, which leads to a storm of packets if it's in a good position. |
} | ||
|
||
// Time until this thread runs again | ||
return 4 * 1000; // 4 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just what you ask it to do. If it doesn't have time to schedule this thread, it won't run this fast.
ArduinoThread (which this utilizes underneath) is not a real-time OS:
It should be noted that these are not “threads” in the real computer-science meaning of the term: tasks are implemented as functions that are run periodically. On the one hand, this means that the only way a task can yield the CPU is by returning to the caller, and it is thus inadvisable to delay() or do long waits inside any task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, so if some other threads are doing lengthy work in such a thread, this is absolutely a problem and we should find out where this happens (we already knows that wifi init is blocking!). If all other threads behaves well, and schedules lengthy work to slower loops or queues, this should work, but we can schedule it faster to leave more headroom in case calls is a bit delayed.
I will add some diagnostics output while fieldtesting to get some indication of the precision of the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should find out where this happens
Yes, I agree, but it can be anywhere in the firmware, under any kind of scenario. It's very hard to guarantee that if you fix one case, it won't appear somewhere else and if it leads to an endless reboot loop, that would be a nasty regression.
Just to make it clear :-D the 3 hour reboots was after having only the 8 seconds timeout, so no need to debug into that ;) I am fieldtesting the current code now to check what uptimes we get with the current revision, then I also need to make some tests to find a baseline - all of this takes some time, so I'll just leave it as a draft until I am a bit wiser on the general stability. |
Ah, I see. Likely it wasn't returning to |
Hi @The3rdPlace , just checking in to see how your tests went ... |
Closing this to clean our pull request queue since it's been waiting on progress for a while. Feel free to re-propose any time! |
The watchdog must be enabled late in the boot stage since the rp2040's watchdog only allow for 8 seconds delay before resetting, and some of the initialization calls after the call to rp2040Setup() and even after the first call to rp2040Loop() blocks execution long enough to exceed this delay.
Due to this architectural problem, we still have a potential freeze on the rp2040 if it hangs during boot and initialization, but practical experience seems to indicate that in most cases, a freeze happens long into the running state (often days), so unless one builds a node to be deployed deep into the wilderness, we accept this risk.
There is not any likely solution to the short delay counter in the hardware watchdog since this is based on hardware registers.