-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Critical fault #12 (flash filesystem corruption) on Heltec T114 v2 #5839
Comments
Do you use an Android phone to connect to your Heltec? There are a few other bugs that look similar to this one. |
Yes, I use an Android phone. |
Same Error on T-Echo. Logs attached. Use Android with 2.5.16 beta App. ERROR | ??:??:?? 4 Error: can't encode protobuf io error |
T-Echo ←[36m←[0m ←[34m DEBUG ←[0m| 15:53:01 2493 [Router] ←[34m Opening /prefs/db.proto, fullAtomic=0←[0m |
I think i have the same problem (T114 v2), my device is in a boot loop and i can't get it to start at all DEBUG | ??:??:?? 5 Expand short PSK #1 |
From the initial report:
And from trying to reproduce the issue last night (from a RAK WisBlock 4631):
Both have something Bluetooth related happening around the same time as an update to the db.proto file. The Bluetooth library also writes to the file system. https://github.com/geeksville/Adafruit_nRF52_Arduino/blob/master/libraries/Bluefruit52Lib/src/utility/bonding.cpp Could there be some preemption happening that causes a LFS change to happen at the same time as another LFS change is happening? Edit: Another question to investigate: For each bad block detected, does the usable size of the LFS filesystem decrease by a block? Meaning, after a while of this same thing happening, could it become impossible to write new files due to the file system not having enough free blocks? Edit2: Seems like this would prevent simultaneous access from happening: adafruit/Adafruit_nRF52_Arduino#397 Edit3: https://github.com/geeksville/Adafruit_nRF52_Arduino/blob/4f591d0f71f75e5128fab9dc42ac72f1696cf89f/libraries/Bluefruit52Lib/src/bluefruit.cpp#L711 The flash operations callback inside the bluefruit library? Edit4: Curious why this semaphore allows up to 10 concurrent accesses instead of just 1? https://github.com/geeksville/Adafruit_nRF52_Arduino/blob/4f591d0f71f75e5128fab9dc42ac72f1696cf89f/libraries/InternalFileSytem/src/flash/flash_nrf5x.c#L110C12-L110C36 There is some good analysis in adafruit/Adafruit_nRF52_Arduino#350 & littlefs-project/littlefs#352. This seems to be what lead to the current locking solution in Adafruit_LittleFS. |
I've been able to get some debug logs by modifying this code: https://github.com/geeksville/Adafruit_nRF52_Arduino/blob/4f591d0f71f75e5128fab9dc42ac72f1696cf89f/cores/nRF5/common_func.h#L154-L158 And adding #if __cplusplus
extern "C" void logLegacy(const char *level, const char *fmt, ...);
#define PRINTF(...) logLegacy("DEBUG", __VA_ARGS__)
#else
void logLegacy(const char *level, const char *fmt, ...);
#define PRINTF(...) logLegacy("DEBUG", __VA_ARGS__)
#endif I'm not easily able to reproduce the issue though. |
There are definitely some file accesses that happen when the phone connects.
|
@esev It seems like you've got a lot more knowledge than me here, and it seems like there's some good clues in the thread so far. I've been testing around this issue today, and as far as I can tell: it occurs when Bluetooth disconnects unexpectedly during a flash write. The reason for the disconnection seems relevant. A graceful disconnect, initiated by the phone, gives reason:
If this type of disconnection occurs during a flash write, there is no issue. Disconnection types that I've seen cause issues are:
I see One workaround for the void NRF52Bluetooth::shutdown()
{
// Shutdown bluetooth for minimum power draw
LOG_INFO("Disable NRF52 bluetooth");
uint8_t connection_num = Bluefruit.connected();
if (connection_num) {
for (uint8_t i = 0; i < connection_num; i++) {
LOG_INFO("NRF52 bluetooth disconnecting handle %d", i);
Bluefruit.disconnect(i);
}
// Wait for disconnection
while(Bluefruit.connected())
yield();
LOG_INFO("All bluetooth connections ended");
}
Bluefruit.Advertising.stop();
} I haven't found any workaround for
It may depend on the phone used for testing, but here is how I have been able to reproduce the issue:
|
Looks like we are likely going to need to make some modifications to Bluefruit's persistence. I would prefer it to pull more of those things into memory to cut down on file system IO if possible. That's a shoot from the hip answer though, until I dig in more.
@todd-herbert I think that workaround is worth a PR even if it's not a comprehensive fix. |
With the changes described in #5839 (comment), I've grabbed some more detailed logs of what's going on in the
|
That's very helpful @todd-herbert ! Knowing it happens when the phone goes out of range is helpful. Nothing in the bluefruit library accesses LFS at that point so we can potentially rule out "multiple fs access" as being the cause. This may be odd timeout behavior in the hardware. I'll look into this more after work tonight.
The microwave idea and the save on button press should make debugging much easier. |
Looking at the (very helpful) reference links you've collected, I think I might have a working solution, although possibly I'm just being naive.. I'm not sure what the full implications of it would be. It's late here now but I'll try to get a proof of concept pushed tonight so you can see the general idea and check if there's any merit in it. |
Here's the changes I threw together that appear to fix the issue: flash_nrf5x.c
I seem to now be able to trigger the problematic disconnect with no consequences. Not shown here, but with the additional logging setup you described earlier I was able to add extra debug output to confirm that, in the situation shown above, the flash operations do fail, and are successfully reattempted.
That is the big question.. I've replaced it with a binary semaphore with no immediately apparent consequences, but it's a bit over my head so I'm hoping someone more knowledgeable than me will know if there was some very good reason for that counting semaphore. |
Yes! That is the exact same thing (retries) I was going to try. Nice work! I'm so happy to see the issue go away with that change! :) I'll dig-in more to the semaphore. I think your change addresses the root issue. I'll give it a try tonight. |
Finally got a chance to test. This is working perfectly on my device. I can trigger the issue and I see the retry happen and succeed. No more lfs messages! Prior to testing your change, I also added some logging to show how many simultaneous callers passed through the original semaphore. It was only ever one at a time. That matches what I was expecting too, as I think the device can only perform one flash operation at a time and reports NRF_ERROR_BUSY when one is in progress. I like your change to use a binary semaphore. It seems much less surprising to me. |
One thing. I'm not sure it is necessary to https://github.com/littlefs-project/littlefs/blob/v1.7.2/DESIGN.md |
Very cool! I don't think I would have noticed that it was related to Bluetooth :) |
For keeping track of debugging ideas for the future:
|
That is reassuring to know that you had also independently reached the same conclusion about where we should go with this for a fix!
Ahh that would make sense if
That's a very good point. I'd restructured a bit today to cut back on the amount of duplicate code, and in the process had gotten rid of the assert, out of a general fear of unintentionally disrupting some fs behavior. Hopefully there's nothing similar lurking in today's changes too. I've opened it up as draft at meshtastic/Adafruit_nRF52_Arduino#1, and would very much appreciate your feedback if you spot anything we could do better. |
I just got a another LFS_ASSERT crash. Not boot looping this time though. I tried using https://github.com/todd-herbert/meshtastic-nrf52-arduino/tree/reattempt-flash-ops to see if it would bring it back up but it did not fix it. The only way I have been able to rescusitate it i to add FSCom.format() in to the arduino setup function. Then once it has been reformatted on first boot, re-flash with the format line removed. |
When I was inducing that 0x22 ble disconnect failure on shutdown, I would see it eventually end up in that same situation, after I had induced seveal shutdown-corruptions, and the blocks had remapped several times. Maybe there's a clue there? Did you give the device a full erase recently, or has it had a history of these corruption events already? If it does lock up again, one idea might be to enable the extra logging as described by #5839 (comment), to see if there's any interesting info. I imagine you could do this even after it locks up. |
Yes my this device has had a history of this happening. This is the third time I've had to deal with it. First time it was reboot looping because I think the nodedb got too big but I also left it in the house and might have had a the BT distance thing happen. My attempts at full erase didn't the fix the issue but I did try the full erase during the first and second occurences. |
Different device but adding more diagnostic data, this just happened to my Nano G2 a second time (previously a month ago), clearing nodeDB fixed it on a hunch after a few days of "nodeDB-full" warnings on the very populated Boston mesh. Android meshtastic critical fault 12, but this time I caught it before the boot-loop. early symptom is it forgetting the name I set, about two days before the error display and aggressive reconnect loop. after fault 12 is displayed and the reconnection loop starts, when I connect over USB to clear the nodeDB the region is also UNSET. clearing nodeDB un-wedges. it almost feels like swap starvation or something. or a broken flash controller load balancing algorithm, like the counterfeit SDcards that report 256gb but really they just overwrite preexisting blocks after they go over 2gb interestingly, the random pin also was the same pin over and over for every reconnection attempt until I reset the nodeDB |
Edit: This is described much better in #4447
|
I do remember seeing discussion about that over at #4447, so I think you're probably right on the money again. |
Ha! Thank you for pointing me to that bug. Yes, that describes what I was seeing exactly. Including the likely reason it was implemented this way :) |
that would track with at least my issue. the thing that made mine work was
if it's clobbering 3968 bytes of neighboring files for every erased 128B block it makes sense that rewriting the files would fix it. (Is there any chance this issue leaves garbage file fragments on the filesystem that isn't being prompted to be erased? I'm not familiar with LittleFS) |
Just to clarify. I don't believe this is happening for every write. There are two scenarios where I think it seems likely though:
I had originally thought LittleFS might be losing blocks, and slowly decreasing its capacity over time when this happens. However, after digging a little deeper into LittleFS, I notice it doesn't keep a free or corrupt block list. I think that leads to two outcomes.
|
Category
Other
Hardware
Heltec Mesh Node T114
Firmware Version
2.5.13.1a06f88
Description
Back in November 2024, my T114 showed a Critical fault #12. I rebooted it and it seemed to work OK, but a few days later, it got into a boot loop. The serial debug output was:
Then about a week later, another of my T114s also got into a boot loop. This one doesn't have a screen, so I don't know if it also had a Critical fault #12. I erased the flash, installed 2.5.13.1a06f88, connected the USB port to a PC, and logged all of the serial output, so in case it happened again, I could see what happened right before the first crash and reboot. After about a month, it rebooted:
And after the reboot, these is what it logged before rebooting again:
So it seems that what triggered it was the "Bad block" errors, and maybe the "relocation" code is buggy and corrupts the filesystem? In any case, the "Bad block" errors seem more relevant. From what I see in lfs.c, it looks like the "Bad block" message means a flash routine returned
LFS_ERR_CORRUPT
. However, I didn't see anything in InternalFileSystem.cpp that would returnLFS_ERR_CORRUPT
. So I think that meanslfs_cache_cmp()
returnedfalse
(e.g., line 194 of lfs.c).I haven't looked into the details of how the caching works, but since I don't think the flash is going bad on either of my T114s (V2 hasn't been out that long), I wonder if something else in the firmware is corrupting the memory buffer used for the cache.
Relevant log output
No response
The text was updated successfully, but these errors were encountered: