-
Notifications
You must be signed in to change notification settings - Fork 964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: filesystem corruption if CPU is reset while writing config files. #4395
Comments
btw: see excellent log and comment from @lewisxhe in #3818 (comment). I think we might need to actually define LFS_ASSERT to set a flag indicating that our current file operation had a low level failure and check for that flag in SafeFile (because the lfs routines are not returning a failure code in this case). In particular this code which is called for both file reading and writing:
|
Okay - after a fair amount of stepping through lfs.c, I've definitely found the cause of this bug and I have some fixes to help with it. Alas, I need to clean it up before checkin. It is a little tricky because want to avoid having to fork the adafruit project (again) - so I'll need to put it in our code instead. Unfortunately we have some visitors until Sunday night so I might not be sending this one in until Monday. |
If a LFS filesystem is corrupted, .exists() can fail when a mere .open() attempt would have succeeded. Therefore better to do the .open() in hopes that we can read the file (in case we need to reformat to fix the FS). (Seen and confirmed in stress testing)
* check for LFS assertion failures during file operations (needs customized lfs_util.h to provide suitable hooks) * Remove fsCheck() because checking filesystem by writing to it is very high risk, it makes likelyhood that we will be able to read the config protobufs quite low. * Update the LFS inside of adafruitnrf52 to 1.7.2 (from their old 1.6.1) to get the following fix: littlefs-project/littlefs@97d8d5e
Fixed problemsThe fixes attached to this issue fix 'most' (but not all - see #4447 for findings/risk estimate/recommentation for 3.0) remaining problems for nrf52 filesystem reliability. Problem 1: rebooting while a file is open for writing means that future writes to the filesystem will fail.With the improved LFS debug output we can see that if we rebooted while a file was open for writing the next LFS write will likely fail (in lfs_deorphan).
Issue #3818 improved the LFS_ASSERT behavior by making it not reboot the board, unfortunately it didn't pass up an error code we could detect at file write time so we can handle this failure cleanly. The existing lfs.c code assumed that LFS_ASSERT would NEVER return (if failing), so it has no logic to return a failure code in that case. The good news is that the filesystem READS would still succeed. The bad news is that the (well intended) test write of meshtastic.txt was guaranteed to encounter this failure. So IMO we should not be writing mesthastic.txt as a test. Instead we should attempt to read out config files (reads are not likely to fail in this case), then attempt to rewrite our config files. If the write of our config fails we reformat the FS (was was done when the 'meshtastic.txt' test fails) and write our (valid) config files. The old behaviour would see the failure on writing mesthastic.txt (before we read our config files) and at that point all hope was lost: because it reformatted - losing the config forever. Problem 2: listDir needs workaround for adafruit littlefs filesystem iteration bugWe have to add the following nast check of file.name to listDir().
Because the adafruit iteration code doesn't properly clear _file/_dir in the case of errors returned by little-fs. This makes the check of f.isOpened() falsely return true when really the file does not exist (because we failed while iterating). Problem 3: do not call FSCom.exists in loadProto
The exists method has the same bug as problem 2, if we fail while iterating through any file (even if we would have succeeded in reading from filename), exists returns false. So do not use exists() (so we don't need to fork the adafruit lib to fix it there). Instead just try to open the file and only fail if that operation can not succeed. Problem 4: File renames sometimes caused filesystem corruptionWhile stress testing (rarely - a couple of times a day on a node repeatedly writing/renaming/rebooting/reading files) filesystem corruption was seen that was not caused due to a reboot while writing. As poor luck would have it they fixed it a few days after adafruit forked from the master LFS tree. The version in adafruit was forked on oct 15,2018 adafruit/Adafruit_nRF52_Arduino@f328685. The version in framework-arduinoadafruitnrf52 is: commit cb62bf2188854c5b7c44383571ebb19a414e6137 (tag: v1.6.1) Since it seemed I needed to update LFS, I went ahead and updated it to the 'final' LFS 1.x release (v1.7.2) so that hopefully this update will be more useful to other adafruit users. I'll be sending them a PR but for the time being I'm changing meshtastic to pull from my fresh fork of their project (see attached commits). With this fix the 'rename related' corruption no longer occurs. I updated lfs.c, lfs.h and lfs_util.h by first generating a patch file between the old adafruit tree and v1.6.1 of LFS, then copied in the v1.7.2 version of those three files, then reapplied the patch file so that the (small number of) existing adafruit local changes were preserved.
Rejected theories(These are probably not super interesting, but they were in my notes so I'm keeping them for future searchers/me) Theory: perhaps the problem is that the flash page is still in writing mode upon reboot?!!! Theory: is it possible other flash writes are in the queue so the count of our semaphore gets fucked up? Thoery: does flash_cache.c have a bug Theory: should we throttle nodeDB filesystem writes Theory: icache is disabled if you use debugger to reset CPU while programming? |
* bug #4184: fix config file loss due to filesystem write errors * Use SafeFile for atomic file writing (with xor checksum readback) * Write db.proto last because it could be the largest file on the FS (and less critical) * Don't keep a tmp file around while writing db.proto (because too big to fit two files in the filesystem) * generate a new critial fault if we encounter errors writing to flash either CriticalErrorCode_FLASH_CORRUPTION_RECOVERABLE or CriticalErrorCode_FLASH_CORRUPTION_UNRECOVERABLE (depending on if the second write attempt worked) * reformat the filesystem if we detect it is corrupted (then rewrite our config files) (only on nrf52 - not sure yet if we should bother on ESP32) * If we have to format the FS, make sure to preserve the oem.proto if it exists * add logLegacy() so old C code in libs can log via our logging * move filesList() to a better location (used only in developer builds) * Reformat with "trunk fmt" to match our coding conventions * for #4395: don't use .exists() to before attempting file open If a LFS filesystem is corrupted, .exists() can fail when a mere .open() attempt would have succeeded. Therefore better to do the .open() in hopes that we can read the file (in case we need to reformat to fix the FS). (Seen and confirmed in stress testing) * for #4395 more fixes, see below for details: * check for LFS assertion failures during file operations (needs customized lfs_util.h to provide suitable hooks) * Remove fsCheck() because checking filesystem by writing to it is very high risk, it makes likelyhood that we will be able to read the config protobufs quite low. * Update the LFS inside of adafruitnrf52 to 1.7.2 (from their old 1.6.1) to get the following fix: littlefs-project/littlefs@97d8d5e * use disable_adafruit_usb.py now that we are (temporarily?) using a forked adafruit lib We need to reach inside the adafruit project and turn off USE_TINYUSB, just doing that from platformio.ini is no longer sufficient. Tested on a wio-sdk-wm1110 board (which is the only board that had this problem) --------- Co-authored-by: Ben Meadors <[email protected]>
Category
Hardware Compatibility
Hardware
Rak4631
Firmware Version
master (private build)
Description
Moving power-supply related flash corruption out of #4184 and into this bug.
Relevant comments copied:
btw: the 'good' news is that I can pretty easily reproduce this bug by power cycling a rak4630 (noticed while doing my power testing). I'm still planning on working on this issue once power stuff is done.
oh: interesting: I can cause this corruption by performing a filesystem write 100% of the time on rak4630 by doing a write while the battery voltage is down at 3.2V.
Relevant log output
No response
The text was updated successfully, but these errors were encountered: