-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net_buf reference count not protected #32564
Comments
This change would be welcomed and it most probably fixes your issue. Could you create a PR for this change? |
OK, but as a newbie, I am reading Contribution Guidelines and then will create a PR. |
I think the main reason it's a uint8_t is that it creates a smaller struct size. We should consider solving the same by adding irq_lock/unlock or similar to the appropriate places in the implementation. |
@jhedberg we would need spinlock in order to support SMP. Either global for all net_bufs or one per net_buf, which would consume more space than atomic. Just to confirm, was using net_buf supposed to be thread-safe? There are not so many places where I am uncertain that using atomic reference counting is actually solving described bug. TCP application should operate on |
@mniestroj ok, in that case we're probably better off with moving to atomic_t. I think we should at the same time consider changing the behaviour of zephyr/subsys/bluetooth/host/conn.c Lines 1864 to 1899 in b182ec7
We could even consider creating a |
I am not sure this brings much benefit to |
Have you tried just make this a uint32_t and testing that. That might convey if its a memory or atomic issue. |
Changing type of net_buf ref form uint8_t to atomic_t finally did not prevent my system from crashing, so I did not try uint32_t. The zephyr master branch solves some race condition problems after v2.5 release, so I changed to use master branch code. The system crashed with ARM hard fault as following:
The faulting instruction address locates to a non-text area, it seems to be a stack problem, but I have set all threads' stack size to 8192, also with CONFIG_HW_STACK_PROTECTION=y. It is hard to see what happened before this ARM hard fault. What I am doing now is to use ARM MPU (Memory Protection Unit) to set non-text area to non-executable, so I can know at what point the PC register is changing to an illegal address. |
It seems there are also more protections needed in kernel. I tested dde03c6 (without modification to protect net_buf) and MPU fault occurred:
To avoid NULL pointer access, I use MPU to set ARM address 0x0 (located in ITCM which is not used in my program) as no-access attribute, so that NULL pointer access to 0 will cause MPU FAULT. The faulting instruction address 0x8001b2bc is located in sys_dlist_remove() which is called by unpend_thread_no_timeout(). The NULL pointer in sys_dlist_remove() is 'prev' which is compiled into r2 (value is 0 when MPU fault). The comment of sys_dlist_remove() says: "This and other sys_dlist_*() functions are not thread safe."
|
@chyu313 can you confirm this is still an issue in current main? |
Yes, it still an issue. I keep pulling Zephyr code and testing it. Fault still occurs. |
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time. |
I created a PR that implements k_ref: #64798 |
Describe the bug
![image](https://user-images.githubusercontent.com/56861030/108799430-44eb1500-75cb-11eb-9236-9dcca695fac4.png)
I tested heavy TCP traffic load by 32 connections sending data to 32 tcp echo server threads on MIMXRT1060_EVK board, and at the same time using 'ping -f' to make zephyr busy handling ICMP packet. After one day, system crashed as following:
I tested many times, and there was a time the system halted due to assertion in k_spin_unlock() saying "Not my spinlock %p" in which the spin lock pointer pointed to _net_buf_pool_area. I didn't capture the error message at that time, but I tried to use atomic operation to protect reference count of net_buf by referencing 5ef825f committed by @jukkar . I changed 'uint8_t ref' in 'struct net_buf' to 'atomic_t atomic_ref', and simply used atomic_inc(&buf->atomic_ref) to replace original 'buf->ref++', atomic_dec(&buf->atomic_ref) to replace original 'buf->ref--', atomic_get(&frag->atomic_ref) to replace original code reading reference count.
Though not knowing why this modification can solve my problem, but after protecting net_buf reference count using atomic operation, my test has run for 3 days and keeps running.
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: