Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: filesystem corruption if CPU is reset while writing config files. #4395

Closed
geeksville opened this issue Aug 5, 2024 · 3 comments · Fixed by #4406
Closed

[Bug]: filesystem corruption if CPU is reset while writing config files. #4395

geeksville opened this issue Aug 5, 2024 · 3 comments · Fixed by #4406
Assignees
Labels
bug Something isn't working

Comments

@geeksville
Copy link
Member

Category

Hardware Compatibility

Hardware

Rak4631

Firmware Version

master (private build)

Description

Moving power-supply related flash corruption out of #4184 and into this bug.

Relevant comments copied:

btw: the 'good' news is that I can pretty easily reproduce this bug by power cycling a rak4630 (noticed while doing my power testing). I'm still planning on working on this issue once power stuff is done.

oh: interesting: I can cause this corruption by performing a filesystem write 100% of the time on rak4630 by doing a write while the battery voltage is down at 3.2V.

Relevant log output

No response

@geeksville geeksville added the bug Something isn't working label Aug 5, 2024
@geeksville geeksville self-assigned this Aug 5, 2024
@geeksville
Copy link
Member Author

geeksville commented Aug 6, 2024

btw: see excellent log and comment from @lewisxhe in #3818 (comment).

I think we might need to actually define LFS_ASSERT to set a flag indicating that our current file operation had a low level failure and check for that flag in SafeFile (because the lfs routines are not returning a failure code in this case).

In particular this code which is called for both file reading and writing:

static int lfs_ctz_find(lfs_t *lfs,
        lfs_cache_t *rcache, const lfs_cache_t *pcache,
        lfs_block_t head, lfs_size_t size,
        lfs_size_t pos, lfs_block_t *block, lfs_off_t *off) {
    if (size == 0) {
        *block = 0xffffffff;
        *off = 0;
        return 0;
    }

    lfs_off_t current = lfs_ctz_index(lfs, &(lfs_off_t){size-1});
    lfs_off_t target = lfs_ctz_index(lfs, &pos);

    while (current > target) {
        lfs_size_t skip = lfs_min(
                lfs_npw2(current-target+1) - 1,
                lfs_ctz(current));

        int err = lfs_cache_read(lfs, rcache, pcache, head, 4*skip, &head, 4);
        head = lfs_fromle32(head);
        if (err) {
            return err;
        }

        LFS_ASSERT(head >= 2 && head <= lfs->cfg->block_count);
        current -= 1 << skip;
    }

    *block = head;
    *off = pos;
    return 0;
}

@geeksville geeksville changed the title [Bug]: filesystem corruption due to low VCCH on RAK4631 [Bug]: filesystem corruption if CPU is reset while writing config files. Aug 6, 2024
@geeksville
Copy link
Member Author

geeksville commented Aug 8, 2024

Okay - after a fair amount of stepping through lfs.c, I've definitely found the cause of this bug and I have some fixes to help with it. Alas, I need to clean it up before checkin. It is a little tricky because want to avoid having to fork the adafruit project (again) - so I'll need to put it in our code instead. Unfortunately we have some visitors until Sunday night so I might not be sending this one in until Monday.

geeksville added a commit to geeksville/Meshtastic-esp32 that referenced this issue Aug 12, 2024
If a LFS filesystem is corrupted, .exists() can fail when a mere .open()
attempt would have succeeded.  Therefore better to do the .open() in hopes that
we can read the file (in case we need to reformat to fix the FS).
(Seen and confirmed in stress testing)
geeksville added a commit to geeksville/Meshtastic-esp32 that referenced this issue Aug 12, 2024
* check for LFS assertion failures during file operations (needs customized lfs_util.h to provide suitable hooks)
* Remove fsCheck() because checking filesystem by writing to it is very high risk, it makes likelyhood that we will
be able to read the config protobufs quite low.
* Update the LFS inside of adafruitnrf52 to 1.7.2 (from their old 1.6.1) to get the following fix:
littlefs-project/littlefs@97d8d5e
@geeksville
Copy link
Member Author

geeksville commented Aug 12, 2024

Fixed problems

The fixes attached to this issue fix 'most' (but not all - see #4447 for findings/risk estimate/recommentation for 3.0) remaining problems for nrf52 filesystem reliability.

Problem 1: rebooting while a file is open for writing means that future writes to the filesystem will fail.

With the improved LFS debug output we can see that if we rebooted while a file was open for writing the next LFS write will likely fail (in lfs_deorphan).

INFO  | ??:??:?? 5 Saving /prefs/config.proto
ERROR | ??:??:?? 5 LFS assert: pcache->block == 0xffffffffLFS assert: entry.d.type == LFS_TYPE_REGCan't open tmp file for readback
ERROR | ??:??:?? 5 Can't write prefs!
INFO  | ??:??:?? 5 Saving /prefs/module.proto
ERROR | ??:??:?? 5 LFS assert: entry.d.type == LFS_TYPE_REGCan't open tmp file for readback
ERROR | ??:??:?? 5 Can't write prefs!
INFO  | ??:??:?? 6 Saving /prefs/channels.proto
INFO  | ??:??:?? 6 Saving /prefs/db.proto
INFO  | ??:??:?? 7 Clearing bluetooth bonds!

Issue #3818 improved the LFS_ASSERT behavior by making it not reboot the board, unfortunately it didn't pass up an error code we could detect at file write time so we can handle this failure cleanly.

The existing lfs.c code assumed that LFS_ASSERT would NEVER return (if failing), so it has no logic to return a failure code in that case.
This set of changes includes a new lfs_assert_failed global that the new lfs_assert() function calls. In SafeFile.cpp we now check for this flag and consider the write/read to be failed regardless of what the File methods say.

The good news is that the filesystem READS would still succeed.

The bad news is that the (well intended) test write of meshtastic.txt was guaranteed to encounter this failure.

So IMO we should not be writing mesthastic.txt as a test. Instead we should attempt to read out config files (reads are not likely to fail in this case), then attempt to rewrite our config files. If the write of our config fails we reformat the FS (was was done when the 'meshtastic.txt' test fails) and write our (valid) config files.

The old behaviour would see the failure on writing mesthastic.txt (before we read our config files) and at that point all hope was lost: because it reformatted - losing the config forever.

Problem 2: listDir needs workaround for adafruit littlefs filesystem iteration bug

We have to add the following nast check of file.name to listDir().

while (file && file.name()[0]) { // This file.name() check is a workaround for a bug in the Adafruit LittleFS nrf52 glue 

Because the adafruit iteration code doesn't properly clear _file/_dir in the case of errors returned by little-fs. This makes the check of f.isOpened() falsely return true when really the file does not exist (because we failed while iterating).

Problem 3: do not call FSCom.exists in loadProto

    if (!FSCom.exists(filename)) {
        LOG_INFO("File %s not found\n", filename);
        return LoadFileResult::NOT_FOUND;
    }

The exists method has the same bug as problem 2, if we fail while iterating through any file (even if we would have succeeded in reading from filename), exists returns false. So do not use exists() (so we don't need to fork the adafruit lib to fix it there). Instead just try to open the file and only fail if that operation can not succeed.

Problem 4: File renames sometimes caused filesystem corruption

While stress testing (rarely - a couple of times a day on a node repeatedly writing/renaming/rebooting/reading files) filesystem corruption was seen that was not caused due to a reboot while writing.
After some debugging and checking github history, it seems that someone else had already found and fixed this bug:
littlefs-project/littlefs@97d8d5e

As poor luck would have it they fixed it a few days after adafruit forked from the master LFS tree. The version in adafruit was forked on oct 15,2018 adafruit/Adafruit_nRF52_Arduino@f328685. The version in framework-arduinoadafruitnrf52 is: commit cb62bf2188854c5b7c44383571ebb19a414e6137 (tag: v1.6.1)

Since it seemed I needed to update LFS, I went ahead and updated it to the 'final' LFS 1.x release (v1.7.2) so that hopefully this update will be more useful to other adafruit users. I'll be sending them a PR but for the time being I'm changing meshtastic to pull from my fresh fork of their project (see attached commits).

With this fix the 'rename related' corruption no longer occurs.

I updated lfs.c, lfs.h and lfs_util.h by first generating a patch file between the old adafruit tree and v1.6.1 of LFS, then copied in the v1.7.2 version of those three files, then reapplied the patch file so that the (small number of) existing adafruit local changes were preserved.

adafruit specific changes to LFS are capured now in this patch file in ~/tmp
diff -crB ~/development/meshtastic/littlefs littlefs > dfile.patch
kevinh@kdesktop:~/.platformio/packages/framework-arduinoadafruitnrf52/libraries/Adafruit_LittleFS/src$ code dfile.patch 
kevinh@kdesktop:~/.platformio/packages/framework-arduinoadafruitnrf52/libraries/Adafruit_LittleFS/src$ mv dfile.patch ~/tmp/lfs-adafruit.patch
apply patch with "patch -p0 --dry-run < dfile.patch"

Rejected theories

(These are probably not super interesting, but they were in my notes so I'm keeping them for future searchers/me)

Theory: perhaps the problem is that the flash page is still in writing mode upon reboot?!!!
Result: No - flash is not in write mode after reboot

Theory: is it possible other flash writes are in the queue so the count of our semaphore gets fucked up?
Result: NO - because happens even if soft device is disabled with useSoftDevice=false!

Thoery: does flash_cache.c have a bug
Result: NO (I think) due to fairly careful inspection (and temporarily disabling caching entirely)

Theory: should we throttle nodeDB filesystem writes
Result: NO. Currently we only write when about to reboot/shutdown or 1-minute after any DB changes. So in the normal case we'll be writing very rarely anyways
so the remaining risk is lowish.

Theory: icache is disabled if you use debugger to reset CPU while programming?
Result: Probably minor and possibly related to me being under the debugger.
https://infocenter.nordicsemi.com/topic/ps_nrf52840/nvmc.html?cp=5_0_0_3_2#register.ICACHECNF

thebentern added a commit that referenced this issue Aug 13, 2024
* bug #4184: fix config file loss due to filesystem write errors
* Use SafeFile for atomic file writing (with xor checksum readback)
* Write db.proto last because it could be the largest file on the FS (and less critical)
* Don't keep a tmp file around while writing db.proto (because too big to fit two files in the filesystem)
* generate a new critial fault if we encounter errors writing to flash
either CriticalErrorCode_FLASH_CORRUPTION_RECOVERABLE or CriticalErrorCode_FLASH_CORRUPTION_UNRECOVERABLE
(depending on if the second write attempt worked)
* reformat the filesystem if we detect it is corrupted (then rewrite our config files) (only on nrf52 - not sure
yet if we should bother on ESP32)
* If we have to format the FS, make sure to preserve the oem.proto if it exists

* add logLegacy() so old C code in libs can log via our logging

* move filesList() to a better location (used only in developer builds)

* Reformat with "trunk fmt" to match our coding conventions

* for #4395: don't use .exists() to before attempting file open
If a LFS filesystem is corrupted, .exists() can fail when a mere .open()
attempt would have succeeded.  Therefore better to do the .open() in hopes that
we can read the file (in case we need to reformat to fix the FS).
(Seen and confirmed in stress testing)

* for #4395 more fixes, see below for details:
* check for LFS assertion failures during file operations (needs customized lfs_util.h to provide suitable hooks)
* Remove fsCheck() because checking filesystem by writing to it is very high risk, it makes likelyhood that we will
be able to read the config protobufs quite low.
* Update the LFS inside of adafruitnrf52 to 1.7.2 (from their old 1.6.1) to get the following fix:
littlefs-project/littlefs@97d8d5e

* use disable_adafruit_usb.py now that we are (temporarily?) using a forked adafruit lib
We need to reach inside the adafruit project and turn off USE_TINYUSB, just doing that
from platformio.ini is no longer sufficient.

Tested on a wio-sdk-wm1110 board (which is the only board that had this problem)

---------

Co-authored-by: Ben Meadors <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant