Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stm32: QSPI flash driver concurrent access issue #38548

Closed
hjuul opened this issue Sep 15, 2021 · 27 comments
Closed

stm32: QSPI flash driver concurrent access issue #38548

hjuul opened this issue Sep 15, 2021 · 27 comments
Assignees
Labels
area: QSPI Quad SPI bug The issue is a bug, or the PR is fixing a bug platform: STM32 ST Micro STM32 priority: low Low impact/importance bug

Comments

@hjuul
Copy link
Contributor

hjuul commented Sep 15, 2021

Describe the bug
Reading or writing from/to QSPI flash from different contexts doesn't work.
flash_stm32_qspi.c seems to not be thread safe by design.

Impact
File system API doesn't handle concurrency either, so it is left to the application.
But in my application there are several subsystems that also access the QSPI driver (logging with fs backend, settings API, fcb) so handling concurrency in the application is nontrivial.

Logs and console output
This is what happens when I try to read log files through the file system API while the logging thread is running:

[00:00:31.720,004] <err> flash_stm32_qspi: 1: Failed to send QSPI instruction
[00:00:31.720,024] <err> fs: failed get file or dir stat (-5)
[00:00:32.020,145] <err> flash_stm32_qspi: 2: Failed to send QSPI instruction
[00:00:32.020,152] <err> flash_stm32_qspi: 2: Failed to send QSPI instruction
[00:00:32.020,161] <err> flash_stm32_qspi: 2: Failed to send QSPI instruction
[00:00:32.020,176] <err> flash_stm32_qspi: 2: Failed to send QSPI instruction
[00:00:32.020,189] <err> fs: file write error (-5)
[00:00:32.020,220] <err> flash_stm32_qspi: 2: Failed to send QSPI instruction
[00:00:32.020,234] <err> fs: directory open error (-5)

Environment (please complete the following information):

  • Zephyr v2.7.0-rc1
@hjuul hjuul added the bug The issue is a bug, or the PR is fixing a bug label Sep 15, 2021
@FRASTM FRASTM added area: QSPI Quad SPI platform: STM32 ST Micro STM32 labels Sep 15, 2021
@FRASTM
Copy link
Collaborator

FRASTM commented Sep 15, 2021

@hjuul could you please detail if there is specific target board, shield and appli that your are using to show this error ?

@hjuul
Copy link
Contributor Author

hjuul commented Sep 15, 2021

@FRASTM I'm using a custom board based on stm32f777 with QSPI flash MT25QL256ABA.
I'm also using a custom application.

Some more details on Discord: https://discord.com/channels/720317445772017664/887588796839903242/887588800635748352

@erwango erwango self-assigned this Sep 15, 2021
@erwango erwango added the priority: low Low impact/importance bug label Sep 15, 2021
@erwango
Copy link
Member

erwango commented Sep 15, 2021

@hjuul Having a new look to this.

So maybe I was missing coffee when I initially answered to you but
flash_stm32_qspi.c is actually thread safe (cf qspi_lock_thread/qspi_unlock_thread).

The reason why you're seeing issue is hence not as obvious as initially thought.
We'd need more info on the context of the first error reported.

Would you mind enabling DBG logs and see what shows up ?

@erwango erwango changed the title stm32: QSPI flash driver is not thread safe stm32: QSPI flash driver concurrent access issue Sep 15, 2021
@hjuul
Copy link
Contributor Author

hjuul commented Sep 17, 2021

I spent some time debugging this, but haven't found the root cause yet.

Enabling DBG logs wasn't feasible because it increases the number of qspi writes, which in turn increases the number of logs, and so on.

A few observations:

  • I've narrowed it down to the QSPI_WaitFlagStateUntilTimeout() call in HAL_QSPI_Command_IT. When the failure happens the system freezes for 5s (which is the value of hqspi->Timeout), then QSPI_WaitFlagStateUntilTimeout() returns HAL_ERROR.
  • It takes time to reproduce the first failure after system power-up. But when a failure has occured, all future read accesses to the QSPI flash fails. It seems the QUADSPI->SR->BUSY bit is stuck at 1 in this case
  • I have indications (but not 100%) that the first failure only happens when the reading thread interrupts the logging thread
  • I noticed that the QSPI driver only uses a single data line (doesn't use FAST READ). At a QSPI frequency of 54MHz I reckon that a single QSPI access could easily take up to 10us, enough time for me to do another access. Could it be an issue with the dev_data->sync semaphore?

Side note: In stm32f746g_disco.dts the quadspi max frequency is set to 72MHz. If I understand the corresponding qspi flash data sheet correctly, the correct value should be 54MHz (because the stm32 driver only uses the READ (03h) command, not FAST READ)

@FRASTM
Copy link
Collaborator

FRASTM commented Sep 17, 2021

QSPI_WaitFlagStateUntilTimeout

In HAL_QSPI_Command_IT, the BUSY flag is checked until sw timeout, but do you see the (HW) Timeout flag (TOF) set in the status register QUADSPI_SR

@hjuul
Copy link
Contributor Author

hjuul commented Sep 20, 2021

@FRASTM , When the failure has occurred, SR->BUSY is stuck at one but SR->TOF is not set. (FTF and TCF bits are set and FLEVEL=28. TEF=0)

@FRASTM
Copy link
Collaborator

FRASTM commented Sep 20, 2021

For the moment I do not see this error on my sample application. I am Sharing few ideas based on the AN4760
Looking into the code I find that the default timeout value is 5000U (from HAL_QSPI_TIMEOUT_DEFAULT_VALUE) (this 5s is enough if tick is of 1ms, but ...)
The flash_stm32_qspi_erase() function does not return the return code of the qspi_wait_until_ready(dev);
The flash_stm32_qspi_read() function does not wait for qspi_wait_until_ready(dev) after qspi_read_access()
The return code of the HAL_QSPI_Command_IT is not identified when HAL_ERROR is returned.
So the dev_data->hqspi->ErrorCode == HAL_QSPI_ERROR_TIMEOUT is not handled. In the case a Timeout occurs during the HAL_QSPI_Command_IT, should an ABORT command be sent ?

@FRASTM
Copy link
Collaborator

FRASTM commented Sep 20, 2021

With log, the failure could appear more clearly in LOG_ERR("%d: Failed to send QSPI instruction", hal_ret); if you suspect an error during the qspi_read_access()
In this case trying to abort could help unlock the transfer:

@@ -147,6 +147,10 @@ static int qspi_read_access(const struct device *dev, QSPI_CommandTypeDef *cmd,
 	hal_ret = HAL_QSPI_Command_IT(&dev_data->hqspi, cmd);
 	if (hal_ret != HAL_OK) {
 		LOG_ERR("%d: Failed to send QSPI instruction", hal_ret);
+		hal_ret = HAL_QSPI_Abort_IT(&dev_data->hqspi);
+		if (hal_ret != HAL_OK) {
+			LOG_ERR("%d: Failed to abort QSPI instruction", hal_ret);
+		}
 		return -EIO;
 	}

@hjuul
Copy link
Contributor Author

hjuul commented Sep 20, 2021

@FRASTM , from the log output of my original post you can see that hal_ret is 1 on first failure, then 2 on all subsequent accesses.

I will try with the additional abort tomorrow. But even if that fixes the symptom, isn't it a problem that I get an error in the first place?

@hjuul
Copy link
Contributor Author

hjuul commented Sep 21, 2021

@FRASTM I tried your patch, and also applied it in qspi_write_access() . HAL_QSPI_Abort_IT() always returns HAL_OK and it prevents my system from freezing for 5s. But it doesn't really solve my problem because I still can't access the QSPI flash, and the failure still persists after the first failure.

In the meantime I've updated to Zephyr v2.7.0-rc2, and there I haven't been able to reproduce the failure. If I go back to v2.7.0-rc1 it starts failing again. So either something has been fixed between rc1 and rc2, or there is some code relocation or similar that fixes a timing issue. Update: It eventually failed on rc2 also.

@FRASTM
Copy link
Collaborator

FRASTM commented Sep 21, 2021

In the flash_stm32_qspi.c, two callback functions are not defined
HAL_QSPI_AbortCpltCallback
HAL_QSPI_FifoThresholdCallback
So in case one of those irq is raising, the dev_data->sync should also be released :

@@ -540,6 +540,28 @@ void HAL_QSPI_TimeOutCallback(QSPI_HandleTypeDef *hqspi)
 	k_sem_give(&dev_data->sync);
 }
 
+/*
+ * Status FifoThreshold callback.
+ */
+void HAL_QSPI_FifoThresholdCallback(QSPI_HandleTypeDef *hqspi)
+{
+	struct flash_stm32_qspi_data *dev_data =
+		CONTAINER_OF(hqspi, struct flash_stm32_qspi_data, hqspi);
+
+	k_sem_give(&dev_data->sync);
+}
+
+/*
+ * Status Abort callback.
+ */
+void HAL_QSPI_AbortCpltCallback(QSPI_HandleTypeDef *hqspi)
+{
+	struct flash_stm32_qspi_data *dev_data =
+		CONTAINER_OF(hqspi, struct flash_stm32_qspi_data, hqspi);
+
+	k_sem_give(&dev_data->sync);
+}
+

@hjuul
Copy link
Contributor Author

hjuul commented Sep 22, 2021

@FRASTM, I applied your latest patch but I'm still able to reproduce my issue.

From my trace it seems HAL_QSPI_FifoThresholdCallback() is called lots of times though. (I have QUADSPI_CR.FTHRES=7)

Btw your patch has a side effect. At boot I now get the following error:
[00:00:00.000,439] <err> flash_stm32_qspi: Unexpected flash size: 16845047
(my qspi flash is 32M)

@FRASTM
Copy link
Collaborator

FRASTM commented Sep 28, 2021

@hjuul would it be possible to have the part of your application or a simple change to apply on the existing samples/drivers/spi_flash that could reproduce the pb you see.

@hjuul
Copy link
Contributor Author

hjuul commented Oct 4, 2021

@FRASTM, I'm afraid I don't have time to work with this at the moment, but I'll continue to look into this later.

@FRASTM
Copy link
Collaborator

FRASTM commented Oct 20, 2021

PR #39570 to add qspi callback functions for stm32 driver

@github-actions
Copy link

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

@hjuul
Copy link
Contributor Author

hjuul commented Jan 10, 2022

Hi, I'm reopening this issue.

In the original post I was seeing this issue when using LittleFS logging backend with the file system on QSPI flash. Since I couldn't make it work I disabled that feature.

Now I'm using NVS with a storage partition in QSPI flash, and I'm seeing the exact same issue. I'm not using the QSPI flash for anything else, so I no longer believe this is an issue of concurrent access to the QSPI flash.

As before, the failure happens rarely, but when it does occur I can not use the QSPI flash again until I do a system reset.

I believe there are several issues here:

  1. Every now and then HAL_QSPI_Command_IT() gets stuck waiting for the busy flag. I don't know why. Could be hardware issue, but I doubt it
  2. QSPI_WaitFlagStateUntilTimeout() has a timeout of 5s, which is waaay to much, espicially for read operations. So when 1 happens, my system stalls for 5s and I can't change it
  3. Once 1 has happened, hal_qspi sets hqspi->State = HAL_QSPI_STATE_ERROR. This isn't handled in any way in the Zephyr driver, so all subsequent accesses to the QSPI flash will return HAL_BUSY.

Issues 2) and 3) could probably be solved quite easily.
For issue 1) I have no idea what's going on, but here are some details:

I've attached a trace view of the last NVS read before the failure occurs. The relevant functions context are shown in red bars, the QSPI ISR is the yellow bar. This looks as expected (to me at least), and notice that QUADSPI->SR==0x00 at the end of qspi_read_access().
The next time I do an NVS read, more than a second later, QUADSPI->SR==0x2026 already when entering qspi_read_access(), indicating that the QUADSPI FIFO is full and BUSY, FTF and TCF flags are set. I don't understand how those could be set - there's been no invocations of flash_stm32_qspi_isr() in-between. And due to this, sending the READ command times out after 5s.

I'd really appreciate any suggestions on this; both on why this failure happens in the first place, and how to recover gracefully from such a failure. I'll gladly provide more data if needed, and I'm working to try to reproduce this on a disco board.

nvs_read

@hjuul
Copy link
Contributor Author

hjuul commented Jan 10, 2022

@FRASTM, could you please reopen this issue?

@erwango erwango reopened this Jan 10, 2022
@github-actions github-actions bot removed the Stale label Jan 11, 2022
@FRASTM FRASTM self-assigned this Jan 11, 2022
@hjuul
Copy link
Contributor Author

hjuul commented Jan 18, 2022

I've been looking into this for the past week now. I still don't know what is causing it, but I've got some more information.

HAL_QSPI_IRQHandler() disables the QSPI interrupt(s) when done. I tried re-enabling them to see what happened:

static void flash_stm32_qspi_isr(const struct device *dev)
{
	struct flash_stm32_qspi_data *dev_data = DEV_DATA(dev);

	HAL_QSPI_IRQHandler(&dev_data->hqspi);
	__HAL_QSPI_ENABLE_IT(&dev_data->hqspi, QSPI_IT_TE | QSPI_IT_TC); // Enable interrupts
}

With this patch, sometimes I see that a QSPI interrupt is triggered 212us after flash_stm32_qspi_read() is done! Upon entering the ISR, QUADSPI->SR=0x2026. I have no idea what triggers this, but every time this happens, my next attempt to read the QSPI flash will fail.

I am able to work around this by checking the BUSY bit at the start of qspi_read_access(), and then doing an abort procedure if it is set - but it is not very satisfactory:

	if (READ_BIT(QUADSPI->SR, QUADSPI_SR_BUSY) != 0) {
		LOG_ERR("QUADSPI->SR=0x%X, aborting", QUADSPI->SR);
		SET_BIT(QUADSPI->CR, QUADSPI_CR_ABORT);
		while (READ_BIT(QUADSPI->CR, QUADSPI_CR_ABORT) != 0) {
			LOG_ERR("wait");
			k_sleep(K_MSEC(1));
		}
	}

If you have any idea what could be causing this extra interrupt after the read procedure is done, or have suggestions on how I can debug this further, please let me know.

@FRASTM
Copy link
Collaborator

FRASTM commented Jan 19, 2022

Depending on which interrupts are set, among
• Timeout
• Status match
• FIFO threshold
• Transfer complete
• Transfer error
You should see which one is coming after the read procedure

Did you had a look at the the Errata sheet ?

@hjuul
Copy link
Contributor Author

hjuul commented Jan 19, 2022

QUADSPI->SR=0x2026, so that means BUSY, FTF and TCF flags are set.

I've gone through the erratas but nothing there seems to apply to my case.

@hjuul
Copy link
Contributor Author

hjuul commented Mar 16, 2022

I'm closing this issue.

It turned out to be not related to QSPI flash driver.
The failure was caused by an application thread occasionally writing to QUADSPI registers because of an invalid pointer (yes, that thread has been rewritten to run in user mode now 🤦 )

Thanks anyway for looking into this, and sorry for the inconvenience.

@hjuul hjuul closed this as completed Mar 16, 2022
@erwango
Copy link
Member

erwango commented Mar 16, 2022

@hjuul Thanks for the heads up. Based on your experience in debugging this driver, is there any of the proposals made by @FRASTM that would be useful to keep for debug purpose?

@hjuul
Copy link
Contributor Author

hjuul commented Mar 17, 2022

@erwango: For debug purpose, no. This was a particularly difficult issue to resolve, so I had to instrument the driver and the HAL for visual trace anyway. And it is very easy to add those callback functions for abort and FIFO threshold when needed, so I don't think they need to be in there if they don't serve a purpose.

I do have some other notes though:

  • This driver is very limited in what it can do with regards to QSPI configuration and performance. (I see that v3.0.0 adds support for using some more IO lines in some cases, but I believe FAST READ is still not supported. Or DMA.)
  • QSPI_WaitFlagStateUntilTimeout() has a timeout of 5s, which is waaay to much, espicially for read operations. Should ideally be configurable (that's a HAL issue, but for the average Zephyr user that doesn't matter)
  • If an error is detected, hal_qspi sets hqspi->State = HAL_QSPI_STATE_ERROR. This isn't handled in any way in the Zephyr driver, so all subsequent accesses to the QSPI flash will return HAL_BUSY. Only a reboot will make the QSPI flash accessible again

@erwango
Copy link
Member

erwango commented Mar 17, 2022

@hjuul Thanks for this feedback, we'll keep that in mind for the next time we update this driver.

One note: You mention DMA as not supported, but I do think it is. You 'just' need to add proper 'dmas' property.
See here:

&quadspi {
pinctrl-0 = <&quadspi_clk_pe10 &quadspi_ncs_pe11
&quadspi_bk1_io0_pe12 &quadspi_bk1_io1_pe13
&quadspi_bk1_io2_pe14 &quadspi_bk1_io3_pe15>;
pinctrl-names = "default";
dmas = <&dma1 5 5 0x0000>;
dma-names = "tx_rx";

Or I am missing something ?

@hjuul
Copy link
Contributor Author

hjuul commented Mar 17, 2022

@erwango Looks like you're right. I didn't use DMA though.

@hjuul
Copy link
Contributor Author

hjuul commented Apr 22, 2022

Another update, because I think someone else will run into a similar problem later: The root cause of this seems to be speculative access on ARM Cortex-M7.

The Cortex-M7 may occasionally do speculative reads to any memory that is configured as Normal memory type. By default this includes the address space for the memory mapped QUADSPI.
When a speculative read is done within the address space of my QSPI flash, an actual read instruction is sent on the QSPI lines. This fills the QUADSPI read FIFO and sets the busy bit, and the QSPI driver is not able to handle that situation.

Note! This happens even when the QUADSPI is not configured in memory mapped mode!

The solution is to configure the MPU such that the entire map-able address space for the QUADSPI is set to Strongly Ordered memory. Some more information on that here:
https://drive.google.com/file/d/1g_-mDfAIYs99pRifeAvfpunHBVIdq_a8/view

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: QSPI Quad SPI bug The issue is a bug, or the PR is fixing a bug platform: STM32 ST Micro STM32 priority: low Low impact/importance bug
Projects
None yet
Development

No branches or pull requests

3 participants