-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
subsys/mgmt/hawkbit: Unable to finish download if CPU blocking function (i.e. flash_img_buffered_write
) is used
#37324
Comments
@jukkar Any comments on the |
There was a time where it downloaded 20+% of the image before a hard fault (current thread: sysworkq, not sure why tho), and doesn't reach that far again after a reset, quite random. |
To verify the write speed of the Here are the results: Write that requires a page erase: 43ms The write speed matches what I observe in the
How does writing data into external
|
@pabigot , just wondering if you have any clue about this? |
I increased the So now this becomes very weird. |
@ycsin You are right there is a bug: the target image for upload is taken from the DTS, as the partition defined with "image-1" label, but the write size for the device is determined for the flash selected by chosen "zephyr,flash", which may be different device, as it probably happens here. |
@de-nordic thanks for the info, I will look more closely to it later! After this I encountered another bug where if I flash an image to the primary slot over the updated image, it will result in a hard fault inside the bootloader. I debugged and traced this just now, it seems to happen in the swap scratch function. So for some reason the bootloader is thinking that there's an upgrade and then hard fault. I've left my workplace, will update this comment later. I think there's another bug: after it booted up the updated image, which is unconfirmed, if I reset the device, the bootloader will boot right into the primary instead of doing a swap. (Likely to be another issue) EDIT: The hard fault of the bootloader happens inside this while loop. |
I encountered the issue again today even after this change and removing the flash-write part didn't help, I tried the Downloading the file using my laptop is still very fast, still not sure where the issue lies. I guess I might have to work around this using partial downloads. |
@Navin-Sankar Any comments? |
That setting the download block size to match the page size eliminates the problem suggests to me that we are running into a situation where instead of just writing a page in pieces, we are erasing it and rewrite it for each of the smaller blocks. Whether a page can be partially written does depend on the flash device in question. Most NOR flashes are able to write to parts of a page just without problem. Other devices do require the whole page to be written once between erases. If that is the case, I'm not sure there is a very good solution other than to buffer an entire page. I didn't readily find the information about the STM part in question, but most of these allow a minimum write size of 8 or 16 bytes, so it should be possible to write to different 16-byte parts of a page without needing to erase the page. |
@d3zd3z after a few days of testing, I'm guessing that the cause of this issue might be networking/socket related, instead of the I do encounter a bootloader-related issue as mentioned in this comment, I think I should create a separate issue for it, after testing for a few more times to confirm the pattern. |
After further testing I found that the download is slow when something is blocking during the reception of the data, such as Increasing Will doing the |
This makes me think of something. I haven't studied the STM driver all that much, but I know that some flash drivers make the flash device itself inaccessible during the flash operation (the flash goes away while being programmed). They generally run code out of RAM to do this, and run with interrupts masked. If well written (and the hardware supports it), they can interrupt the flash operation, make the code reappear, and handle the interrupt. Again I don't know the behavior here. But, is it possible to figure out if the flash_img_buffered_write() might be blocking for a while and causing packets to be dropped? Dropped packets would slow down the transfer significantly, waiting for retransmits from the server. |
Not sure about STM32 flash either, but I'm using an external
I used wireshark to sniff the transfer on my laptop, seems like the client does not need to acknowledge every packet for the host to continue its transfer, the acknowledgement only happens every 3-4 packets. I used a uart dongle to read the TX of the modem and there's nothing much going there, since it seems like acknowledge isn't required for the transfer to continue, I'd expect it to continue output data but it isn't. But I'm not sure about the networking part. Maybe @mniestroj @rlubos @jukkar can shed some light here? |
@ycsin From the networking point of view, it could be the case, that due to increased CPU consumption the application thread is not able to consume the data fast enough, so we fill up all of the RX buffers. If the low-level network driver does not specify a timeout for the allocation, or the timeout is too small, it'd drop the packet, which would result in retransmission at TCP level and decreased performance. We had a similar issue with one of the Ethernet drivers not long ago (see #36891 (comment)) - the application did not catch up with the incoming traffic, which resulted in a packet drop. Adding a timeout for the Now, I'm not very familiar with the PPP implementation, but there seems to be a similar case with the ppp driver. Probalby worth checking if it's related? |
Thanks heaps for the pointer. I'm not familiar with PPP nor networking, but this does sound like something I'm facing, I will try to test that in the ppp driver. |
I think there are a few ways that can solve this issue, from the networking perspective:
and from the application perspective, in this case the hawkbit/stream_flash:
I will try the PPP fix as soon as I can and do a PR if it does improve/solve the issue. For the hawkbit part, I guess I will make a PR to erase the slot-1 prior to download at some point. But all these seems to be some workarounds, I think the real issue is that both the PPP and the Hawkbit/application are using too much CPU time for the entire operation. I wonder if the data transmission from modem to memory in PPP and the data transmission from memory to SPI flash can be offloaded to DMA? @jukkar @mniestroj any idea for the PPP part? |
Maybe worth to consider moving calls to |
@nvlsianpu is it better to offload this in |
flash_img_buffered_write
) is used
@rlubos some update on this, I've moved to a new version of my custom board and that uses AT45 flash with a page of 512B, the previous one (AT25) was 4KiB, this makes the format to happen quite frequently and caused the upgrade to fail, even with the ppp delay at 200 msec. What is the trade off if I continue to increase this delay? I tried to enable SPI DMA and sometimes I get this error. |
Some more updates: I think the CPU is simply too occupied by the networking stuff and UART interrupts due to firmware transfer: at least 7kiB per second, up to ~15 KiB per second from my observation. This is when I comment out the |
I think it depends on the driver/modem really, if packets keep coming and they are not consumed due to lack of RX buffer, the modem will eventually drop them. It could be the case that the respective driver/modem has buffering capabilities on it's own, but they're not unlimited. So if we're not talking about sporadic CPU business causing delay, but a general CPU overload we should rather think of limiting the data flow at TCP level. Typically, you can limit the amount of data sent by the server by decreasing the Recieve Window size sent in the ACK messge. Unfortunately, from what I've seen, TCP2 implementation does not implement the feature, and sends a fixed value of IPv6 MTU all the time, effectively giving no limits to the server. I suggest to open an enhancement issue for the feature, as IMO it's quite valuable for constrained devices, and it seems to be a regressinon compared to the previous TCP implementation (where Recieve window handling was implemented). |
Is this feature limited to IPv6 only? or including IPv4? |
Well, I don't want to say for the author, but it seems that IPv6 MTU was just an arbitrary value chosen for the Window Size. It should not matter whether IPv4 or IPv6 is used at the IP layer. |
I workaround this by enabling SPI async, I wonder if #39275 would be able to fix this for SOCs that doesnt support SPI async by limiting the download speed to what the CPU can gracefully process? |
After testing that PR a bit it turned out that recv window handling will require a bit more effort to implement than proposed in the PR. So this is still an open topic. |
I have tried to use hawkBit on a new board that uses the It stucked at 66% by default, if I increase the I think the @rlubos @mniestroj any idea? EDIT:
|
Correct. TCP stack is implemented on ESP chip. @ycsin Do you use UART hardware flow control? |
@mniestroj thanks for the reply.
Are you aware of anything that can be configured which might be able to workaround this issue? EDIT: |
@ycsin Having HW flow control is quite critical, as it will suspend sending/receiving data stream when the other end is busy (e.g. processing previous packet). Without that, part of the frame can be easily dropped and there is no easy way to recover from that. |
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time. |
Update 2
The upgrade works consistently after I enabled SPI async API as that offload the writing of the firmware from the CPU to the DMA, freeing the CPU to handle the TCP packets.
Update 1
CPU blocking functions such as
flash_img_buffered_write()
or log immediate mode can cause this issue. Happens to thebig_http_download
sample as well.Describe the bug
I'm currently trying the
hawkbit
sample for the OTA on my custom board. For some reason the download process is extremely long and the gap between incoming data is increasingly long. The first handful chunks of data is normal, after that the next chunk would arrive at ~30s later, then ~1 min later, then almost exactly about 2 mins later (like there's a pattern) and eventually it would fail after downloading about 20kB (it usually fails at exactly 20 kB) or after ~17 minutes.After trying a few things, I found that the download is slow only if it tries to write the downloaded buffer into the flash here. If I comment that line out, the download is actually pretty fast and typically managed to download everything (251kB) under 40 seconds. If I replace that line with a
k_msleep(60)
, it also downloads just as fast.To determine the time
flash_img_buffered_write()
took to finish, I tried to calculate the elapsed time usingk_uptime_get()
before and after the function call, and the log system tells me that it took 6-41ms. And now I'm pretty much lost and not sure how to debug this and get it working.From my current understanding, the
hawkbit
will do ahttp GET
request to the download link using thehttp_client_req
then the connection/socket should be opened and the image chunks will be pushed continuously from the server without requiring ack from the client (I could be wrong here). Thehawkbit
client will simply write the received data into the flash until theHTTP_DATA_FINAL
is received. I don't know how a function that literally returned immediately can affect this download process.The setup that I use:
gsm_ppp
driver)image-0
partition is 416 kB in the internal flashimage-1
partition is 416 kB in the externalspi-nor
flashTo Reproduce
Steps to reproduce the behavior:
hawkbit
sampleExpected behavior
I expect the download to finish under 40 seconds.
Impact
Unable to use
hawkbit
Logs and console output
hawkbit_log.txt
Environment (please complete the following information):
Extra context
This is the log file when I replace the
flash_img_buffered_write()
withk_msleep(60)
: hawkbit_test_log.txtThe text was updated successfully, but these errors were encountered: