Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Critical fault #12 (flash filesystem corruption) on Heltec T114 v2 #5839

Open
dahanc opened this issue Jan 13, 2025 · 29 comments
Open
Labels
bug Something isn't working

Comments

@dahanc
Copy link
Contributor

dahanc commented Jan 13, 2025

Category

Other

Hardware

Heltec Mesh Node T114

Firmware Version

2.5.13.1a06f88

Description

Back in November 2024, my T114 showed a Critical fault #⁠12. I rebooted it and it seemed to work OK, but a few days later, it got into a boot loop. The serial debug output was:

INFO  | ??:??:?? 6 Adding node to database with 1 nodes and 147308 bytes free!
DEBUG | ??:??:?? 6 Expanding short PSK #1
INFO  | ??:??:?? 6 Wanted region 1, using US
INFO  | ??:??:?? 6 Saving /prefs/db.proto
lfs warn:314: No more free space 224
ERROR | ??:??:?? 6 Error: can't encode protobuf io error
ERROR | ??:??:?? 6 LFS assert: head >= 2 && head <= lfs->cfg->block_count
ERROR | ??:??:?? 6 LFS assert: head >= 2 && head <= lfs->cfg->block_count
ERROR | ??:??:?? 6 LFS assert: block < lfs->cfg

Then about a week later, another of my T114s also got into a boot loop. This one doesn't have a screen, so I don't know if it also had a Critical fault #⁠12. I erased the flash, installed 2.5.13.1a06f88, connected the USB port to a PC, and logged all of the serial output, so in case it happened again, I could see what happened right before the first crash and reboot. After about a month, it rebooted:

INFO  | 20:29:31 2159141 [Router] Received nodeinfo from=0xa20a11c8, id=0x7d3b35e6, portnum=4, payloadlen=44
INFO  | 20:29:31 2159141 [Router] Node database full with 80 nodes and 109096 bytes free. Erasing oldest entry
INFO  | 20:29:31 2159141 [Router] Adding node to database with 80 nodes and 109096 bytes free!
DEBUG | 20:29:31 2159141 [Router] old user /, channel=0
DEBUG | 20:29:31 2159141 [Router] Update changed=1 user MKTX G2 Gateway/MKG3, channel=0
DEBUG | 20:29:31 2159141 [Router] State: ON
DEBUG | 20:29:31 2159141 [Router] Node status update: 22 online, 80 total
INFO  | 20:29:31 2159141 [Router] Save /prefs/db.proto
lfs debug:617: Bad block at 126
lfs debug:617: Bad block at 180
INFO  | 20:29:36 2159146 [Router] BLE Disconnected, reason = 0x8
DEBUG | 20:29:36 2159146 [Router] PhoneAPI::close()
lfs debug:640: Relocating 126 160 to 181 160
ERROR | 20:29:37 2159147 [Router] LFS assert: head >= 2 && head <= lfs->cfg->block_count
ERROR | 20:29:37 2159147 [Rou

And after the reboot, these is what it logged before rebooting again:

INFO  | ??:??:?? 3 Adding node to database with 1 nodes and 153800 bytes free!
DEBUG | ??:??:?? 4 Expand short PSK #1
INFO  | ??:??:?? 4 Wanted region 1, using US
DEBUG | ??:??:?? 4 Coerce telemetry to min of 30 minutes on defaults
INFO  | ??:??:?? 4 Save /prefs/db.proto
ERROR | ??:??:?? 4 LFS assert: block < lfs->cfg->block_cou

So it seems that what triggered it was the "Bad block" errors, and maybe the "relocation" code is buggy and corrupts the filesystem? In any case, the "Bad block" errors seem more relevant. From what I see in lfs.c, it looks like the "Bad block" message means a flash routine returned LFS_ERR_CORRUPT. However, I didn't see anything in InternalFileSystem.cpp that would return LFS_ERR_CORRUPT. So I think that means lfs_cache_cmp() returned false (e.g., line 194 of lfs.c).

I haven't looked into the details of how the caching works, but since I don't think the flash is going bad on either of my T114s (V2 hasn't been out that long), I wonder if something else in the firmware is corrupting the memory buffer used for the cache.

Relevant log output

No response

@dahanc dahanc added the bug Something isn't working label Jan 13, 2025
@esev
Copy link
Contributor

esev commented Jan 13, 2025

Do you use an Android phone to connect to your Heltec? There are a few other bugs that look similar to this one.

@dahanc
Copy link
Contributor Author

dahanc commented Jan 13, 2025

Yes, I use an Android phone.

@daniel-hofer
Copy link

Same Error on T-Echo. Logs attached. Use Android with 2.5.16 beta App.

techolog.txt

ERROR | ??:??:?? 4 Error: can't encode protobuf io error
ERROR | ??:??:?? 4 Can't write prefs!
ERROR | ??:??:?? 4 Failed to save to disk, retrying
DEBUG | ??:??:?? 5 Opening /prefs/config.proto, fullAtomic=1
INFO | ??:??:?? 5 Save /prefs/config.proto
DEBUG | ??:??:?? 5 Opening /prefs/channels.proto, fullAtomic=1
INFO | ??:??:?? 5 Save /prefs/channels.proto
DEBUG | ??:??:?? 6 Opening /prefs/db.proto, fullAtomic=0
INFO | ??:??:?? 6 Save /prefs/db.proto
ERROR | ??:??:?? 6 NOTE! Record critical error 12 at src/mesh/NodeDB.cpp:1220

@liquidraver
Copy link

T-Echo
2.5.19 with ~10 nodes in the nodeDB
full wipe before install:

←[36m←[0m ←[34m DEBUG ←[0m| 15:53:01 2493 [Router] ←[34m Opening /prefs/db.proto, fullAtomic=0←[0m
←[37m←[0m ←[32m INFO ←[0m| 15:53:01 2493 [Router] ←[32m Save /prefs/db.proto←[0m
←[0m lfs warn:313: No more free space 160←[0m
←[0m lfs warn:313: No more free space 160←[0m
←[31m←[0m ←[31m ERROR ←[0m| 15:53:02 2494 [Router] ←[31m Error: can't encode protobuf io error←[0m
←[31m←[0m ←[31m ERROR ←[0m| 15:53:02 2494 [Router] ←[31m Can't write prefs!←[0m
←[31m←[0m ←[31m ERROR ←[0m| 15:53:02 2494 [Router] ←[31m Failed to save to disk, retrying←[0m
←[0m ←[34m lfs debug:617: Bad block at 1←[0m
←[0m lfs warn:625: Superblock 1 has become unwritable←[0m
←[0m ←[34m lfs debug:617: Bad block at 0←[0m
←[0m lfs warn:625: Superblock 0 has become unwritable←[0m
←[0m ←[31m lfs error:493: Corrupted dir pair at 0 1←[0m
←[36m←[0m ←[34m DEBUG ←[0m| 15:53:05 2497 [Router] ←[34m Opening /prefs/db.proto, fullAtomic=0←[0m
←[37m←[0m ←[32m INFO ←[0m| 15:53:06 2498 [Router] ←[32m Save /prefs/db.proto←[0m
←[0m ←[34m lfs debug:1414: Bad block at 102617196←[0m
←[31m←[0m ←[31m ERROR ←[0m| 15:53:06 2498 [Router] ←[31m LFS assert: block < lfs->cfg->block_count←[0m
←[31m←[0m ←[31m ERROR ←[0m| 15:53:06 2498 [Router] ←[31m Trying FSCom.format()←[0m
←[36m←[0m ←[34m DEBUG ←[0m| 15:53:06 2498 [Router] ←[34m Send module config: audio←[0m
←[36m←[0m ←[34m DEBUG ←[0m| 15:53:11 2503 [Router] ←[34m Send module config: remote hardware←[0m
←[36m←[0m ←[34m DEBUG ←[0m| 15:53:20 2512 [Router] ←[34m Send module config: neighbor info←[0m
←[36m←[0m ←[34m DEBUG ←[0m| 15:53:22 2514 [Router] ←[34m Send module config: ambient lighting←[0m
←[36m←[0m ←[34m DEBUG ←[0m| 15:53:24 2516 [Router] ←[34m Send module config: detection sensor←[0m
←[36m←[0m ←[34m DEBUG ←[0m| 15:53:26 2518 [Router] ←[34m Send module config: paxcounter←[0m
←[36m←[0m ←[34m DEBUG ←[0m| 15:53:29 2521 [Router] ←[34m Send known nodes←[0m
←[37m←[0m ←[32m INFO ←[0m| 15:53:29 2521 [Router] ←[32m nodeinfo: num=0x75bd7aac, lastseen=0, id=!75bd7aac, name=Meshtastic 7aac←[0m
←[37m←[0m ←[32m INFO ←[0m| 15:53:35 2527 [Router] ←[32m BLE Disconnected, reason = 0x8←[0m
WARNING file:stream_interface.py __reader line:207 Meshtastic serial port disconnected, disconnecting... ClearCommError failed (PermissionError(13, 'The device does not recognize the command.', None, 22))

@Ztrby
Copy link

Ztrby commented Jan 14, 2025

I think i have the same problem (T114 v2), my device is in a boot loop and i can't get it to start at all

DEBUG | ??:??:?? 5 Expand short PSK #1
INFO | ??:??:?? 5 Wanted region 3, using EU_868
DEBUG | ??:??:?? 5 Coerce telemetry to min of 30 minutes on defaults
INFO | ??:??:?? 6 Save /prefs/db.proto
ERROR | ??:??:?? 6 LFS assert: block < lfs->cfg->block_count
ERROR | ??:??:?? 6 Trying FSCom.format()

@esev
Copy link
Contributor

esev commented Jan 15, 2025

From the initial report:

INFO  | 20:29:31 2159141 [Router] Save /prefs/db.proto
lfs debug:617: Bad block at 126
lfs debug:617: Bad block at 180
INFO  | 20:29:36 2159146 [Router] BLE Disconnected, reason = 0x8
DEBUG | 20:29:36 2159146 [Router] PhoneAPI::close()
lfs debug:640: Relocating 126 160 to 181 160

And from trying to reproduce the issue last night (from a RAK WisBlock 4631):

INFO  | 01:48:53 1293 [Router] Adding node to database with 80 nodes and 117812 bytes free!
DEBUG | 01:48:53 1293 [Router] Update changed=1 user North Fremont - Lakes&Birds Neigh/NFMT, id=0xe0f79ab8, channel=1
DEBUG | 01:48:53 1293 [Router] Node status update: 34 online, 80 total
lfs debug:617: Bad block at 94
lfs debug:617: Bad block at 178
INFO  | 01:48:55 1295 [Router] BLE Disconnected, reason = 0x8
DEBUG | 01:48:55 1295 [Router] PhoneAPI::close()
lfs debug:639: Relocating 94 93 to 179 93
INFO  | 01:48:56 1296 [Router] Save /prefs/db.proto

Both have something Bluetooth related happening around the same time as an update to the db.proto file.

The Bluetooth library also writes to the file system. https://github.com/geeksville/Adafruit_nRF52_Arduino/blob/master/libraries/Bluefruit52Lib/src/utility/bonding.cpp

Could there be some preemption happening that causes a LFS change to happen at the same time as another LFS change is happening?

Edit: Another question to investigate: For each bad block detected, does the usable size of the LFS filesystem decrease by a block? Meaning, after a while of this same thing happening, could it become impossible to write new files due to the file system not having enough free blocks?

Edit2: Seems like this would prevent simultaneous access from happening: adafruit/Adafruit_nRF52_Arduino#397

Edit3: https://github.com/geeksville/Adafruit_nRF52_Arduino/blob/4f591d0f71f75e5128fab9dc42ac72f1696cf89f/libraries/Bluefruit52Lib/src/bluefruit.cpp#L711 The flash operations callback inside the bluefruit library?

Edit4: Curious why this semaphore allows up to 10 concurrent accesses instead of just 1? https://github.com/geeksville/Adafruit_nRF52_Arduino/blob/4f591d0f71f75e5128fab9dc42ac72f1696cf89f/libraries/InternalFileSytem/src/flash/flash_nrf5x.c#L110C12-L110C36 There is some good analysis in adafruit/Adafruit_nRF52_Arduino#350 & littlefs-project/littlefs#352. This seems to be what lead to the current locking solution in Adafruit_LittleFS.

@esev
Copy link
Contributor

esev commented Jan 15, 2025

I've been able to get some debug logs by modifying this code: https://github.com/geeksville/Adafruit_nRF52_Arduino/blob/4f591d0f71f75e5128fab9dc42ac72f1696cf89f/cores/nRF5/common_func.h#L154-L158

And adding -DCFG_DEBUG=2 -DCFG_LOGGER=0 to build_flags

#if __cplusplus
  extern "C" void logLegacy(const char *level, const char *fmt, ...);
  #define PRINTF(...)    logLegacy("DEBUG", __VA_ARGS__)
#else
  void logLegacy(const char *level, const char *fmt, ...);
  #define PRINTF(...)    logLegacy("DEBUG", __VA_ARGS__)
#endif

I'm not easily able to reproduce the issue though.

@esev
Copy link
Contributor

esev commented Jan 15, 2025

There are definitely some file accesses that happen when the phone connects.

[BLE   ] BLE_GAP_EVT_CONNECTED : Conn Handle = 0#
[GAP   ] MAC = 69:26:0F:XX:XX:XX, Type = 2, Resolved = 0#
[GAP   ] Conn Interval = 48.75 ms, Latency = 0, Supervisor Timeout = 5000 ms#
[BLE   ] BLE_GATTC_EVT_CHAR_VAL_BY_UUID_READ_RSP : Conn Handle = 0#
INFO  | 05:28:49 236 BLE Connected to Pixel 8
[BLE   ] BLE_GAP_EVT_CONN_PARAM_UPDATE : Conn Handle = 0#
[GAP   ] Conn Interval= 7.50 ms, Latency = 0, Supervisor Timeout = 5000 ms#
[BLE   ] BLE_GAP_EVT_CONN_PARAM_UPDATE : Conn Handle = 0#
[GAP   ] Conn Interval= 30.00 ms, Latency = 0, Supervisor Timeout = 2000 ms#
[BLE   ] BLE_GAP_EVT_SEC_INFO_REQUEST : Conn Handle = 0#
[PAIR  ] Address: ID = 0, Type = 0x02, 93:0D:4B:XX:XX:XX#
[BOND  ] Loaded keys from file /adafruit/bond_prph/XXXXXXXXXXX#
[BLE   ] BLE_GAP_EVT_CONN_SEC_UPDATE : Conn Handle = 0#
[BOND  ] Loaded CCCD from file /adafruit/bond_prph/XXXXXXXXX ( offset = 90, len = 32 bytes )#
[PAIR  ] Security Mode = 1, Level = 3#
INFO  | 05:28:50 237 BLE connection secured
[BLE   ] BLE_GATTS_EVT_EXCHANGE_MTU_REQUEST : Conn Handle = 0#
[GAP   ] ATT MTU is changed to 247#
[BLE   ] BLE_GATTS_EVT_WRITE : Conn Handle = 0#
[GATTS ] attr's cccd = 0x0001#
[BOND  ] CCCD matches file /adafruit/bond_prph/XXXXXXXXXX contents, no need to write#
[BLE   ] BLE_GAP_EVT_DATA_LENGTH_UPDATE_REQUEST : Conn Handle = 0#
[GAP   ] Data Length Request is (tx, rx) octets = (251, 251), (tx, rx) time = (17040, 17040) us#
[BLE   ] BLE_GAP_EVT_DATA_LENGTH_UPDATE : Conn Handle = 0#
[GAP   ] Data Length is (tx, rx) octets = (251, 251), (tx, rx) time = (2120, 2120) us#
[BLE   ] BLE_GATTS_EVT_RW_AUTHORIZE_REQUEST : Conn Handle = 0#
[BLE   ] BLE_GATTS_EVT_WRITE : Conn Handle = 0#
[GATTS ] attr's value, uuid = 0x76D2, len = 2#

@todd-herbert
Copy link
Contributor

todd-herbert commented Jan 15, 2025

@esev It seems like you've got a lot more knowledge than me here, and it seems like there's some good clues in the thread so far. I've been testing around this issue today, and as far as I can tell: it occurs when Bluetooth disconnects unexpectedly during a flash write.

The reason for the disconnection seems relevant.

A graceful disconnect, initiated by the phone, gives reason:

0x13 BLE_HCI_REMOTE_USER_TERMINATED_CONNECTION

If this type of disconnection occurs during a flash write, there is no issue.

Disconnection types that I've seen cause issues are:

0x08 BLE_HCI_CONNECTION_TIMEOUT
0x22 BLE_HCI_STATUS_CODE_LMP_RESPONSE_TIMEOUT

I see 0x08 when the device moves out of range.
I see 0x22 when the device is disconnected on NRF52Bluetooth::shutdown.
If either of these disconnections occur during a flash write, lfs reports bad blocks, and corrupts the flash data.

One workaround for the0x22 seems to be to modify NRF52Bluetooth::shutdown, to poll for disconnection before continuing.
This doesn't get to the root of the issue, but does seems to reliably solve this one case.

void NRF52Bluetooth::shutdown()
{
    // Shutdown bluetooth for minimum power draw
    LOG_INFO("Disable NRF52 bluetooth");
    uint8_t connection_num = Bluefruit.connected();
    if (connection_num) {
        for (uint8_t i = 0; i < connection_num; i++) {
            LOG_INFO("NRF52 bluetooth disconnecting handle %d", i);
            Bluefruit.disconnect(i);
        }

        // Wait for disconnection
        while(Bluefruit.connected())
            yield();
        LOG_INFO("All bluetooth connections ended");
    }
    Bluefruit.Advertising.stop();
}

I haven't found any workaround for 0x08 disconnections so far. It would be nice if there is a true solution to be found somewhere deeper down in Bluefruit / the softdevice, but that's probably beyond my skill level.


I'm not easily able to reproduce the issue though.

It may depend on the phone used for testing, but here is how I have been able to reproduce the issue:

  • Heltec T114 (V1)
  • Android app 2.5.16
  • No existing config required (erased before flash)

0x22

0x08

  • Modify Fix NRF52 default append write mode of files #5858 firmware: add nodeDB->saveToDisk() to ButtonThread.cpp, so that user button pretty triggers flash write
  • Pair phone via Bluetooth
  • Place phone in unplugged microwave to shield RF (sophisticated.. I know)
  • Node powered by laptop, for portable logging.
  • Bring the node very near the closed microwave, allowing Bluetooth to automatically reconnect.
  • Move away from the microwave. A short distance (~3m seems to be enough in my case).
  • Before the Bluetooth connection is lost, press the user button, triggering the flash write. This seems to occur within several seconds, so it is fairly easy to time this without repeatedly pressing the button.

This is all a bit unorthodox, but using this method I seem to be able to repeatedly reproduce the logs from #5839 (comment)

@thebentern
Copy link
Contributor

There are definitely some file accesses that happen when the phone connects.

Looks like we are likely going to need to make some modifications to Bluefruit's persistence. I would prefer it to pull more of those things into memory to cut down on file system IO if possible. That's a shoot from the hip answer though, until I dig in more.

One workaround for the0x22 seems to be to modify NRF52Bluetooth::shutdown, to poll for disconnection before continuing.
This doesn't get to the root of the issue, but does seems to reliably solve this one case.

@todd-herbert I think that workaround is worth a PR even if it's not a comprehensive fix.

@todd-herbert
Copy link
Contributor

With the changes described in #5839 (comment), I've grabbed some more detailed logs of what's going on in the BLE_HCI_CONNECTION_TIMEOUT (0x08) scenario (disconnect out of range). Not sure if there's anything to be made of it, but here's the info:

DEBUG | 12:14:45 87 [Power] Battery: usbPower=0, isCharging=0, batMv=4159, batPct=97
DEBUG | 12:14:47 89 [Button] Opening /prefs/config.proto, fullAtomic=1
INFO  | 12:14:47 89 [Button] Save /prefs/config.proto
[SOC   ] NRF_EVT_FLASH_OPERATION_ERROR#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
lfs debug:617: Bad block at 30
[SOC   ] NRF_EVT_FLASH_OPERATION_ERROR#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
lfs debug:617: Bad block at 29
[BLE   ] BLE_GAP_EVT_DISCONNECTED : Conn Handle = 0#
[GAP   ] Disconnect Reason: CONNECTION_TIMEOUT #
INFO  | 12:14:49 91 [Button] BLE Disconnected, reason = 0x8
DEBUG | 12:14:49 91 [Button] PhoneAPI::close()
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
lfs debug:639: Relocating 30 5 to 35 5
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
lfs debug:2526: Found desync 35 5
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
DEBUG | 12:14:49 92 [Button] Opening /prefs/module.proto, fullAtomic=1
INFO  | 12:14:49 92 [Button] Save /prefs/module.proto
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
DEBUG | 12:14:50 92 [Button] Opening /prefs/channels.proto, fullAtomic=1
INFO  | 12:14:50 92 [Button] Save /prefs/channels.proto
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
DEBUG | 12:14:50 93 [Button] Opening /prefs/db.proto, fullAtomic=0
INFO  | 12:14:50 93 [Button] Save /prefs/db.proto
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
DEBUG | 12:14:51 93 [Button] State: ON
bool BLEAdvertising::_start(uint16_t, uint16_t): 377: verify failed, error = NRF_ERROR_INVALID_STATE#
DEBUG | 12:15:05 107 [Power] Battery: usbPower=0, isCharging=0, batMv=4148, batPct=97

@esev
Copy link
Contributor

esev commented Jan 15, 2025

That's very helpful @todd-herbert ! Knowing it happens when the phone goes out of range is helpful. Nothing in the bluefruit library accesses LFS at that point so we can potentially rule out "multiple fs access" as being the cause.

This may be odd timeout behavior in the hardware. I'll look into this more after work tonight.

The microwave idea and the save on button press should make debugging much easier.

@todd-herbert
Copy link
Contributor

todd-herbert commented Jan 15, 2025

Looking at the (very helpful) reference links you've collected, I think I might have a working solution, although possibly I'm just being naive.. I'm not sure what the full implications of it would be. It's late here now but I'll try to get a proof of concept pushed tonight so you can see the general idea and check if there's any merit in it.

@todd-herbert
Copy link
Contributor

todd-herbert commented Jan 15, 2025

Here's the changes I threw together that appear to fix the issue: flash_nrf5x.c

DEBUG | 16:17:19 43 [Button] Opening /prefs/config.proto, fullAtomic=1
INFO  | 16:17:19 43 [Button] Save /prefs/config.proto
DEBUG | 16:17:20 44 [Button] Opening /prefs/module.proto, fullAtomic=1
INFO  | 16:17:20 44 [Button] Save /prefs/module.proto
DEBUG | 16:17:20 44 [Button] Opening /prefs/channels.proto, fullAtomic=1
INFO  | 16:17:20 44 [Button] Save /prefs/channels.proto
INFO  | 16:17:25 49 [Button] BLE Disconnected, reason = 0x8
DEBUG | 16:17:25 49 [Button] PhoneAPI::close()
DEBUG | 16:17:25 49 [Button] Opening /prefs/db.proto, fullAtomic=0
INFO  | 16:17:25 49 [Button] Save /prefs/db.proto
DEBUG | 16:17:25 49 [Button] State: ON
DEBUG | 16:17:25 49 [Power] Battery: usbPower=0, isCharging=0, batMv=4172, batPct=98
INFO  | 16:17:25 49 [DeviceTelemetry] Send: air_util_tx=0.002611, channel_utilization=0.156667, battery_level=98, voltage=4.172000, uptime=49

I seem to now be able to trigger the problematic disconnect with no consequences. Not shown here, but with the additional logging setup you described earlier I was able to add extra debug output to confirm that, in the situation shown above, the flash operations do fail, and are successfully reattempted.

Edit4: Curious why this semaphore allows up to 10 concurrent accesses instead of just 1?

That is the big question.. I've replaced it with a binary semaphore with no immediately apparent consequences, but it's a bit over my head so I'm hoping someone more knowledgeable than me will know if there was some very good reason for that counting semaphore.

@esev
Copy link
Contributor

esev commented Jan 15, 2025

Yes! That is the exact same thing (retries) I was going to try. Nice work! I'm so happy to see the issue go away with that change! :)

I'll dig-in more to the semaphore. I'm kind of wondering if everything behaves the same if the max value is decreased to 1 (oh! you effectively did this too). Maybe only one is actually permitted at a time and the changes happen in series by waiting for err != NRF_ERROR_BUSY?

I think your change addresses the root issue. I'll give it a try tonight.

@esev
Copy link
Contributor

esev commented Jan 16, 2025

Finally got a chance to test. This is working perfectly on my device. I can trigger the issue and I see the retry happen and succeed. No more lfs messages!

Prior to testing your change, I also added some logging to show how many simultaneous callers passed through the original semaphore. It was only ever one at a time. That matches what I was expecting too, as I think the device can only perform one flash operation at a time and reports NRF_ERROR_BUSY when one is in progress. I like your change to use a binary semaphore. It seems much less surprising to me.

@esev
Copy link
Contributor

esev commented Jan 16, 2025

One thing. I'm not sure it is necessary to assert inside the flash operations. In theory LittleFS should detect the failure and relocate the failing blocks. It might be okay to give it a chance to do that.

https://github.com/littlefs-project/littlefs/blob/v1.7.2/DESIGN.md

@dahanc
Copy link
Contributor Author

dahanc commented Jan 16, 2025

Very cool! I don't think I would have noticed that it was related to Bluetooth :)

@esev
Copy link
Contributor

esev commented Jan 16, 2025

For keeping track of debugging ideas for the future:

  • I ended up adding a switch (c) statement inside this if statement:
    if (c != START1)
    rxPtr = 0; // failed to find framing
    I used the 'c' character over serial to trigger nodeDB->saveToDisk(). That could be easily extended to trigger other events when debugging as well.
  • A faraday bag (https://www.amazon.com/gp/product/B0BYM9VJMR) works well to trigger the Bluetooth timeout.

@todd-herbert
Copy link
Contributor

Yes! That is the exact same thing (retries) I was going to try

That is reassuring to know that you had also independently reached the same conclusion about where we should go with this for a fix!

Maybe only one is actually permitted at a time and the changes happen in series by waiting for err != NRF_ERROR_BUSY?

Prior to testing your change, I also added some logging to show how many simultaneous callers passed through the original semaphore. It was only ever one at a time. That matches what I was expecting too, as I think the device can only perform one flash operation at a time and reports NRF_ERROR_BUSY when one is in progress.

Ahh that would make sense if NRF_ERROR_BUSY blocks it that way. For what it's worth, I didn't see any instances of NRF_ERROR_BUSY in the wild while testing the proto saving, but I imagine it's totally possible that the Meshtastic firmware just doesn't trigger that sort of a simultaneous write situation.

One thing. I'm not sure it is necessary to assert inside the flash operations. In theory LittleFS should detect the failure and relocate the failing blocks. It might be okay to give it a chance to do that.

That's a very good point. I'd restructured a bit today to cut back on the amount of duplicate code, and in the process had gotten rid of the assert, out of a general fear of unintentionally disrupting some fs behavior. Hopefully there's nothing similar lurking in today's changes too. I've opened it up as draft at meshtastic/Adafruit_nRF52_Arduino#1, and would very much appreciate your feedback if you spot anything we could do better.

@tavdog
Copy link
Contributor

tavdog commented Jan 17, 2025

I just got a another LFS_ASSERT crash. Not boot looping this time though. I tried using https://github.com/todd-herbert/meshtastic-nrf52-arduino/tree/reattempt-flash-ops to see if it would bring it back up but it did not fix it. The only way I have been able to rescusitate it i to add FSCom.format() in to the arduino setup function. Then once it has been reformatted on first boot, re-flash with the format line removed.

@todd-herbert
Copy link
Contributor

When I was inducing that 0x22 ble disconnect failure on shutdown, I would see it eventually end up in that same situation, after I had induced seveal shutdown-corruptions, and the blocks had remapped several times. Maybe there's a clue there?

Did you give the device a full erase recently, or has it had a history of these corruption events already?

If it does lock up again, one idea might be to enable the extra logging as described by #5839 (comment), to see if there's any interesting info. I imagine you could do this even after it locks up.

@tavdog
Copy link
Contributor

tavdog commented Jan 17, 2025

When I was inducing that 0x22 ble disconnect failure on shutdown, I would see it eventually end up in that same situation, after I had induced seveal shutdown-corruptions, and the blocks had remapped several times. Maybe there's a clue there?

Did you give the device a full erase recently, or has it had a history of these corruption events already?

If it does lock up again, one idea might be to enable the extra logging as described by #5839 (comment), to see if there's any interesting info. I imagine you could do this even after it locks up.

Yes my this device has had a history of this happening. This is the third time I've had to deal with it. First time it was reboot looping because I think the nodedb got too big but I also left it in the house and might have had a the BT distance thing happen. My attempts at full erase didn't the fix the issue but I did try the full erase during the first and second occurences.

@NireBryce
Copy link

NireBryce commented Jan 18, 2025

Different device but adding more diagnostic data, this just happened to my Nano G2 a second time (previously a month ago), clearing nodeDB fixed it on a hunch after a few days of "nodeDB-full" warnings on the very populated Boston mesh.

Android meshtastic

critical fault 12, but this time I caught it before the boot-loop.

early symptom is it forgetting the name I set, about two days before the error display and aggressive reconnect loop.

after fault 12 is displayed and the reconnection loop starts, when I connect over USB to clear the nodeDB the region is also UNSET. clearing nodeDB un-wedges.

it almost feels like swap starvation or something. or a broken flash controller load balancing algorithm, like the counterfeit SDcards that report 256gb but really they just overwrite preexisting blocks after they go over 2gb

interestingly, the random pin also was the same pin over and over for every reconnection attempt until I reset the nodeDB

@esev
Copy link
Contributor

esev commented Jan 18, 2025

Edit: This is described much better in #4447

I've been digging a little deeper into the InternalFileSystem implementation. I'm not sure it is setup to work best with LittleFS. Normally LittleFS should be able to detect a failed flash erase/write and recover from that. But I think the InternalFileSystem implementation breaks that assumption.

Reason: The LittleFS documentation specifies that block_size should match the size of an erasable block. That is so LittleFS can read the block contents and compare them with its cache after writing to ensure they were written correctly. InternalFileSystem configures the LittleFS block_size to 128 bytes, while the smallest flash erasable size is actually 4096 bytes. So when/if a flash erase/write fails 4096 bytes of the filesystem could be lost, but LittleFS is only able to detect and recover 128 bytes from that block.

https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/4dcfa3b7960a6dafec6736a5dbea255b0946be83/libraries/InternalFileSytem/src/InternalFileSystem.cpp#L35

https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/4dcfa3b7960a6dafec6736a5dbea255b0946be83/libraries/InternalFileSytem/src/InternalFileSystem.cpp#L109-L113

https://github.com/littlefs-project/littlefs/blob/7e110b44c0e796dc56e2fe86587762d685653029/lfs.h#L146-L150

Am I reading this correctly?

I think retrying the flash writes (meshtastic/Adafruit_nRF52_Arduino#1) is going to help a lot here. But if there is an actual bad flash page, I think LittleFS may never be able to relocate around the issue due to the way it is configured in InternalFileSystem. It may also make the wear-leveling algorithm in LittleFS less effective as LittleFS is unaware of the true physical blocks that are being erased.

Note: I'm not sure there is enough storage on the NRF52 devices to make block_size match the actual flash block size. So this may explain why it is configured this way. But I think this configuration breaks some of the assumptions that LittleFS depends upon for detecting and recovering from bad flash pages.

@todd-herbert
Copy link
Contributor

I do remember seeing discussion about that over at #4447, so I think you're probably right on the money again.

@esev
Copy link
Contributor

esev commented Jan 18, 2025

Ha! Thank you for pointing me to that bug. Yes, that describes what I was seeing exactly. Including the likely reason it was implemented this way :)

@NireBryce
Copy link

that would track with at least my issue.

the thing that made mine work was

  • first killing the nodeDB
  • changing options that cause a reboot (and presumably config write)
  • and pairing a Bluetooth device (presumably creating a new paired devices list)

if it's clobbering 3968 bytes of neighboring files for every erased 128B block it makes sense that rewriting the files would fix it.

(Is there any chance this issue leaves garbage file fragments on the filesystem that isn't being prompted to be erased? I'm not familiar with LittleFS)

@esev
Copy link
Contributor

esev commented Jan 19, 2025

if it's clobbering 3968 bytes of neighboring files for every erased 128B block it makes sense that rewriting the files would fix it.

Just to clarify. I don't believe this is happening for every write. There are two scenarios where I think it seems likely though:

  1. If a Bluetooth device happens to move out of range during a write, the write seems to fail.
  2. If the device loses power during a write.

Is there any chance this issue leaves garbage file fragments on the filesystem that isn't being prompted to be erased? I'm not familiar with LittleFS

I had originally thought LittleFS might be losing blocks, and slowly decreasing its capacity over time when this happens. However, after digging a little deeper into LittleFS, I notice it doesn't keep a free or corrupt block list. I think that leads to two outcomes.

  1. If LittleFS thinks garbage file fragments are part of an existing file/directory then it is likely LittleFS will hit an assertion. I commonly saw head >= 2 && head <= lfs->cfg->block_count being associated with this. There could be other assertions as well.
  2. If LittleFS doesn't have any file/directory referencing the garbage file fragments, then it may reuse & overwrite those fragments when it needs to allocate new space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants