Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

event_loop: Check loop condition before removing event from bucket queue #5649

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zhenyami
Copy link
Contributor

@zhenyami zhenyami commented Jun 27, 2022

Fluent Bit sometimes doesn't resume coroutines for timed out network events. This can also cause deadlocks in some cases, like AWS credentials provider refresh function, if the control is not returned to the function that sets a lock before a network call.

This is caused by the priority event loop handling of network timeout events. That kind of event is injected into the loop and then added to the bucket queue, but only once. If the event is not handled after being removed from the bucket queue, such event is not re-added to the queue, and not processed on loop re-entry. This situation happens because the event is first removed from the bucket queue, and then the loop condition is checked.

  • Changed the order of statements in priority queue loop macro: we now check if the loop has reached the iteration limit before removing the event from the bucket queue.
  • Added a test for event priority loop. It validates that when the event loop finishes all iterations, and the injected event is next in the priority queue, then this event will be processed next time the priority event loop runs.

Signed-off-by: zhenyami [email protected]

Fixes #5553

Related changes

Testing
Before we can approve your change; please submit the following in a comment:

  • [N/A] Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

Documentation

  • [N/A] Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@edsiper
Copy link
Member

edsiper commented Jun 27, 2022

cc: @matthewfala

Copy link
Contributor

@matthewfala matthewfala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! The event loop macro change looks good. Please see my comment.

I have not reviewed any of the other files, only the file containing the event loop macro.

@edsiper Currently on an education leave and returning mid August. If more review is needed I may be able to put some time into it if @lubingfeng permits.

Comment on lines 82 to 87
(__flb_event_priority_live_foreach_iter < max_iter || max_iter == -1) && \
(NULL != ( \
event = flb_bucket_queue_find_min(bktq) \
? mk_list_entry(flb_bucket_queue_pop_min(bktq), struct mk_event, _priority_head) \
: NULL \
)); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Good find. This looks better that what we had before.

To summarize the change:

  • We don't remove the next event from the bucket queue if max-iter is reached.

Regardless, it's interesting how you are seeing lost events. If the event is not processed, I would think it would be re-added to the bucket queue on the next pass because the event is technically still triggered/in the ready list.

But then again, if it keeps happening due to a situation where max_iter events are repeatedly queued before the problematic event, it might keep being dropped and picked up by the loop. Or it could be that event is some kind of one-shot event, which I don't think fluent bit uses currently.

I think your change should be accepted as long as it has been tested thoroughly!

Thanks for your work!

A small style note. There's some fluent bit max line columns limit in the style guide, I forget what it is, but I think the changes may have gone over the style limits. Maybe check to see if the lines are less than whatever that column limit is.

Copy link
Contributor Author

@zhenyami zhenyami Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewfala Thank you for taking the time to check this.

Regardless, it's interesting how you are seeing lost events. If the event is not processed, I would think it would be re-added to the bucket queue on the next pass because the event is technically still triggered/in the ready list.

But then again, if it keeps happening due to a situation where max_iter events are repeatedly queued before the problematic event, it might keep being dropped and picked up by the loop. Or it could be that event is some kind of one-shot event, which I don't think fluent bit uses currently.

That's the thing – I don't think the event is re-added to the bucket queue on the next pass. There are two ways an event is added to the bucket queue:

  • flb_event_load_bucket_queue – doesn't work for the timed out event, as its status is MK_EVENT_NONE
  • flb_event_load_injected_events – this only adds the injected event once, when the injected event is new in the event loop, and event loop actual count > __flb_event_priority_live_foreach_n_events counter variable; this variable is then updated and this injected event is skipped next time

I think your change should be accepted as long as it has been tested thoroughly!

I've been testing this at my job. I think it would be nice to have a test for this, so I'll try to add one.

A small style note. There's some fluent bit max line columns limit in the style guide, I forget what it is, but I think the changes may have gone over the style limits. Maybe check to see if the lines are less than whatever that column limit is.

Thanks for the note. I will check the style guide, and ask Eduardo and team to review.

@edsiper
Copy link
Member

edsiper commented Jul 11, 2022

is this ready to go ?

@zhenyami
Copy link
Contributor Author

zhenyami commented Jul 13, 2022

Additional changes

I reformatted code to be below 90 characters line length limit.
I also added a test to validate injected event handling in priority event loop.

This test fails without the fix to the event loop order:

$ ./bin/flb-it-flb_event_loop
Test test_simple_timeout_1000ms...              [ OK ]
Test test_non_blocking_and_blocking_timeout...  [ OK ]
Test test_infinite_wait...                      [ OK ]
Test event_loop_stress_priority_add_delete...   [ OK ]
Test test_inject_event_priority_loop...         [ FAILED ]
  flb_event_loop.c:633: Check event->priority == 1... failed
    Expected injected event with priority 1, instead got event with priority 2
FAILED: 1 of 5 unit tests has failed.

Test passes with the fix applied:

$ ./bin/flb-it-flb_event_loop
Test test_simple_timeout_1000ms...              [ OK ]
Test test_non_blocking_and_blocking_timeout...  [ OK ]
Test test_infinite_wait...                      [ OK ]
Test event_loop_stress_priority_add_delete...   [ OK ]
Test test_inject_event_priority_loop...         [ OK ]
SUCCESS: All unit tests have passed.

@zhenyami zhenyami force-pushed the event-loop-priority-macro-fix-order branch from 9968c5b to cb81209 Compare July 13, 2022 21:22
@zhenyami
Copy link
Contributor Author

zhenyami commented Jul 13, 2022

Valgrind output

I ran Valgrind with the new test.

$ valgrind ./bin/flb-it-flb_event_loop
...
Test test_inject_event_priority_loop...         
==2428== Conditional jump or move depends on uninitialised value(s)
==2428==    at 0x40B2CF: flb_event_load_bucket_queue_event (flb_event_loop.h:31)
==2428==    by 0x40B458: flb_event_load_injected_events (flb_event_loop.h:67)
==2428==    by 0x40D2AC: test_inject_event_priority_loop (flb_event_loop.c:614)
==2428==    by 0x4096D8: test_do_run_ (acutest.h:1007)
==2428==    by 0x40994D: test_run_ (acutest.h:1103)
==2428==    by 0x40AC35: main (acutest.h:1700)
==2428==
==2428== Conditional jump or move depends on uninitialised value(s)
==2428==    at 0x40F25D: flb_pipe_close (flb_pipe.c:103)
==2428==    by 0x40B67D: test_timeout_destroy (flb_event_loop.c:106)
==2428==    by 0x40D3F5: test_inject_event_priority_loop (flb_event_loop.c:640)
==2428==    by 0x4096D8: test_do_run_ (acutest.h:1007)
==2428==    by 0x40994D: test_run_ (acutest.h:1103)
==2428==    by 0x40AC35: main (acutest.h:1700)
==2428==
==2428== Syscall param close(fd) contains uninitialised byte(s)
==2428==    at 0x4E4278D: ??? (in /lib64/libpthread-2.17.so)
==2428==    by 0x40F26F: flb_pipe_close (flb_pipe.c:107)
==2428==    by 0x40B67D: test_timeout_destroy (flb_event_loop.c:106)
==2428==    by 0x40D3F5: test_inject_event_priority_loop (flb_event_loop.c:640)
==2428==    by 0x4096D8: test_do_run_ (acutest.h:1007)
==2428==    by 0x40994D: test_run_ (acutest.h:1103)
==2428==    by 0x40AC35: main (acutest.h:1700)
==2428==
==2428==
==2428== HEAP SUMMARY:
==2428==     in use at exit: 3,728 bytes in 7 blocks
==2428==   total heap usage: 23 allocs, 16 frees, 10,608 bytes allocated
==2428==
==2428== LEAK SUMMARY:
==2428==    definitely lost: 0 bytes in 0 blocks
==2428==    indirectly lost: 0 bytes in 0 blocks
==2428==      possibly lost: 3,648 bytes in 6 blocks
==2428==    still reachable: 80 bytes in 1 blocks
==2428==         suppressed: 0 bytes in 0 blocks
==2428== Rerun with --leak-check=full to see details of leaked memory

@zhenyami zhenyami force-pushed the event-loop-priority-macro-fix-order branch 5 times, most recently from d70b543 to e0f800b Compare July 14, 2022 06:02
@zhenyami zhenyami marked this pull request as ready for review July 15, 2022 19:21
@zhenyami zhenyami force-pushed the event-loop-priority-macro-fix-order branch 2 times, most recently from 7339ba7 to d5cf8e9 Compare July 15, 2022 23:29
@zhenyami zhenyami marked this pull request as draft July 15, 2022 23:29
@zhenyami zhenyami marked this pull request as ready for review July 18, 2022 15:53
@zhenyami
Copy link
Contributor Author

Ready for review.
I don't know what to do about the MacOS test failure. I see some build issues are being worked on in other PRs.

Fluent Bit sometimes doesn't resume coroutines for timed out network events. This can also cause deadlocks in some cases, like AWS provider credentials refresh function, if the control is not returned to the function that sets a lock before a network call.

This is caused by the priority event loop handling of network timeout events. That kind of event is injected into the loop and then added to the bucket queue, but only once. If the event is not handled after being removed from the bucket queue, such event is not re-added to the queue, and not processed on loop re-entry. This situation happens because the event is first removed, and then the loop condition is checked.

* Changed the order of statements in priority queue loop macro: we now check if the loop has reached the iteration limit before removing the event from the bucket queue.
* Added a test for event priority loop. It validates that when the event loop finishes all iterations, and the injected event is next in the priority queue, then this event will be processed next time the priority event loop runs.

Signed-off-by: zhenyami <[email protected]>
@zhenyami zhenyami force-pushed the event-loop-priority-macro-fix-order branch from d5cf8e9 to 280aeca Compare July 25, 2022 20:14
@matthewfala
Copy link
Contributor

@zhenyami, this change is significant! It's clear you put effort into understanding the event loop system and making appropriate changes.

@edsiper It seems that this is a legitimate problem with the priority event loop that will cause injected events to be dropped in edge cases. Would it be possible prioritizing merging this into Fluent Bit?

The test case looks good.

  • 3 priority 0 events are added, with 1 priority 1 event injected {0,0,0,1}
  • A loop of max 3 iterations loops over the 3 priority 0 events {0,0,0} (potentially drops event 3+1=4, the injected event)
  • 3 priority 2 events are added {1,2,2,2}
  • A loop of max 4 iterations loops over the events {1,2,2,2} checking priorities are handled in the correct order.

@matthewfala
Copy link
Contributor

@leonardo-albertovich Is there any progress on reviewing this PR. Seems like an important fix to a bug in the event loop. Not sure how widespread issues that may arise might be, but I would think at some point people will complain about fluent bit freezing up and that would be linked back to this PR's highlighted issue.

@leonardo-albertovich
Copy link
Collaborator

I haven't but if it others have that's OK by me.

@edsiper
Copy link
Member

edsiper commented Sep 9, 2022

folks, do we have 100% confidence on this PR and tests done ?

I see the unit test here:

https://github.com/fluent/fluent-bit/pull/5649/files#diff-aa41e6d7b7569a427563d189aa551f0563a8d2b3e3f25ea48b556150f308a785R582

but I want to make sure the unit test reproduces the original issue

@lubingfeng
Copy link

@matthewfala and @PettitWesley can you confirm if all required tests are done and we have full confidence on this fix.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2022

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Dec 9, 2022
@yackushevas
Copy link

yackushevas commented Dec 16, 2022

up

@github-actions github-actions bot removed the Stale label Dec 17, 2022
@github-actions
Copy link
Contributor

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Mar 17, 2023
@leonardo-albertovich
Copy link
Collaborator

@matthewfala do you think this is still relevant? Do you think we could wrap it up?

@github-actions github-actions bot removed the Stale label Mar 18, 2023
Comment on lines +82 to +89
(__flb_event_priority_live_foreach_iter < max_iter || max_iter == -1) \
&& (NULL != ( \
event = flb_bucket_queue_find_min(bktq) \
? mk_list_entry(flb_bucket_queue_pop_min(bktq), \
struct mk_event, \
_priority_head) \
: NULL \
)); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can refactor this later, but it think this section would be more readable if we had it as:

        (__flb_event_priority_live_foreach_iter < max_iter || max_iter == -1)           \
        && !flb_bucket_queue_is_empty(bktq)                                             \
        && (event = mk_list_entry(flb_bucket_queue_pop_min(bktq),                       \
                                    struct mk_event,                                    \
                                    _priority_head)) 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this, why would we want to get the event in the check condition for the for loop??

Also, the new condition is added with logical AND &&, then won't this mean the loop will still terminate when the first __flb_event_priority_live_foreach_iter < max_iter returns false, and the statement after the AND won't run. By which I mean- I don't understand the difference between this and making the new bits be the first statements in the loop. (As in not part of the for condition definition).

??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's how the for loop parts are run:

  1. Init
  2. Condition
  3. Code
  4. Update
  5. Condition

If you take an event out in the update and the condition aborts, then you risk loosing one event

  1. Update - take event out of bucket queue
  2. Condition - potentially abort, which gets rid of the event.

With this code change:
We don't remove the next event from the bucket queue if max-iter is reached.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think I might get it now...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewfala
Copy link
Contributor

@leonardo-albertovich, A while back, I looked into this deeply and believe that the tests are sufficient to cause the bug and show that the solution works.

I'm confident in the presented solution. However in the future we can refactor with the suggested change to make it slightly more readable.

@PettitWesley
Copy link
Contributor

@matthewfala

Fluent Bit sometimes doesn't resume coroutines for timed out network events. This can also cause deadlocks in some cases, like AWS credentials provider refresh function, if the control is not returned to the function that sets a lock before a network call.

The issue description kind of sounds like the mk_event_inject bug that we found right? #6822

@leonardo-albertovich
Copy link
Collaborator

I'm out of the loop so I trust your judgment @matthewfala.

As for the bug mentioned by @PettitWesley, I don't think it's related, of course, the priority event loop is the root of both but that's as far as it goes from what I see. I guess there's a lesson to be learned in it.

@matthewfala
Copy link
Contributor

@PettitWesley I believe these bugs are not related. I think this code is ready to be merged.

@github-actions
Copy link
Contributor

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Copy link
Contributor

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

aws imds: coroutine not resumed on connection timeout, locking output
7 participants