event_loop: Check loop condition before removing event from bucket queue #5649

zhenyami · 2022-06-27T19:34:08Z

Fluent Bit sometimes doesn't resume coroutines for timed out network events. This can also cause deadlocks in some cases, like AWS credentials provider refresh function, if the control is not returned to the function that sets a lock before a network call.

This is caused by the priority event loop handling of network timeout events. That kind of event is injected into the loop and then added to the bucket queue, but only once. If the event is not handled after being removed from the bucket queue, such event is not re-added to the queue, and not processed on loop re-entry. This situation happens because the event is first removed from the bucket queue, and then the loop condition is checked.

Changed the order of statements in priority queue loop macro: we now check if the loop has reached the iteration limit before removing the event from the bucket queue.
Added a test for event priority loop. It validates that when the event loop finishes all iterations, and the injected event is next in the priority queue, then this event will be processed next time the priority event loop runs.

Signed-off-by: zhenyami [email protected]

Fixes #5553

Related changes

Testing
Before we can approve your change; please submit the following in a comment:

[N/A] Example configuration file for the change
Debug log output from testing the change

Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

[N/A] Attached local packaging test output showing all targets (including any new ones) build.

Documentation

[N/A] Documentation required for this feature

Backporting

Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

edsiper · 2022-06-27T21:56:50Z

cc: @matthewfala

matthewfala

Hi! The event loop macro change looks good. Please see my comment.

I have not reviewed any of the other files, only the file containing the event loop macro.

@edsiper Currently on an education leave and returning mid August. If more review is needed I may be able to put some time into it if @lubingfeng permits.

matthewfala · 2022-06-27T22:54:05Z

include/fluent-bit/flb_event_loop.h

+        (__flb_event_priority_live_foreach_iter < max_iter || max_iter == -1) &&                      \
+        (NULL != (                                                                                    \
+            event = flb_bucket_queue_find_min(bktq)                                                   \
+                    ? mk_list_entry(flb_bucket_queue_pop_min(bktq), struct mk_event, _priority_head)  \
+                    : NULL                                                                            \
+        ));                                                                                           \


Great! Good find. This looks better that what we had before.

To summarize the change:

We don't remove the next event from the bucket queue if max-iter is reached.

Regardless, it's interesting how you are seeing lost events. If the event is not processed, I would think it would be re-added to the bucket queue on the next pass because the event is technically still triggered/in the ready list.

But then again, if it keeps happening due to a situation where max_iter events are repeatedly queued before the problematic event, it might keep being dropped and picked up by the loop. Or it could be that event is some kind of one-shot event, which I don't think fluent bit uses currently.

I think your change should be accepted as long as it has been tested thoroughly!

Thanks for your work!

A small style note. There's some fluent bit max line columns limit in the style guide, I forget what it is, but I think the changes may have gone over the style limits. Maybe check to see if the lines are less than whatever that column limit is.

@matthewfala Thank you for taking the time to check this.

Regardless, it's interesting how you are seeing lost events. If the event is not processed, I would think it would be re-added to the bucket queue on the next pass because the event is technically still triggered/in the ready list.

But then again, if it keeps happening due to a situation where max_iter events are repeatedly queued before the problematic event, it might keep being dropped and picked up by the loop. Or it could be that event is some kind of one-shot event, which I don't think fluent bit uses currently.

That's the thing – I don't think the event is re-added to the bucket queue on the next pass. There are two ways an event is added to the bucket queue:

flb_event_load_bucket_queue – doesn't work for the timed out event, as its status is MK_EVENT_NONE

flb_event_load_injected_events – this only adds the injected event once, when the injected event is new in the event loop, and event loop actual count > __flb_event_priority_live_foreach_n_events counter variable; this variable is then updated and this injected event is skipped next time

I think your change should be accepted as long as it has been tested thoroughly!

I've been testing this at my job. I think it would be nice to have a test for this, so I'll try to add one.

A small style note. There's some fluent bit max line columns limit in the style guide, I forget what it is, but I think the changes may have gone over the style limits. Maybe check to see if the lines are less than whatever that column limit is.

Thanks for the note. I will check the style guide, and ask Eduardo and team to review.

edsiper · 2022-07-11T23:34:15Z

is this ready to go ?

zhenyami · 2022-07-13T20:46:02Z

Additional changes

I reformatted code to be below 90 characters line length limit.
I also added a test to validate injected event handling in priority event loop.

This test fails without the fix to the event loop order:

$ ./bin/flb-it-flb_event_loop
Test test_simple_timeout_1000ms...              [ OK ]
Test test_non_blocking_and_blocking_timeout...  [ OK ]
Test test_infinite_wait...                      [ OK ]
Test event_loop_stress_priority_add_delete...   [ OK ]
Test test_inject_event_priority_loop...         [ FAILED ]
  flb_event_loop.c:633: Check event->priority == 1... failed
    Expected injected event with priority 1, instead got event with priority 2
FAILED: 1 of 5 unit tests has failed.

Test passes with the fix applied:

$ ./bin/flb-it-flb_event_loop
Test test_simple_timeout_1000ms...              [ OK ]
Test test_non_blocking_and_blocking_timeout...  [ OK ]
Test test_infinite_wait...                      [ OK ]
Test event_loop_stress_priority_add_delete...   [ OK ]
Test test_inject_event_priority_loop...         [ OK ]
SUCCESS: All unit tests have passed.

zhenyami · 2022-07-13T21:25:27Z

Valgrind output

I ran Valgrind with the new test.

$ valgrind ./bin/flb-it-flb_event_loop
...
Test test_inject_event_priority_loop...         
==2428== Conditional jump or move depends on uninitialised value(s)
==2428==    at 0x40B2CF: flb_event_load_bucket_queue_event (flb_event_loop.h:31)
==2428==    by 0x40B458: flb_event_load_injected_events (flb_event_loop.h:67)
==2428==    by 0x40D2AC: test_inject_event_priority_loop (flb_event_loop.c:614)
==2428==    by 0x4096D8: test_do_run_ (acutest.h:1007)
==2428==    by 0x40994D: test_run_ (acutest.h:1103)
==2428==    by 0x40AC35: main (acutest.h:1700)
==2428==
==2428== Conditional jump or move depends on uninitialised value(s)
==2428==    at 0x40F25D: flb_pipe_close (flb_pipe.c:103)
==2428==    by 0x40B67D: test_timeout_destroy (flb_event_loop.c:106)
==2428==    by 0x40D3F5: test_inject_event_priority_loop (flb_event_loop.c:640)
==2428==    by 0x4096D8: test_do_run_ (acutest.h:1007)
==2428==    by 0x40994D: test_run_ (acutest.h:1103)
==2428==    by 0x40AC35: main (acutest.h:1700)
==2428==
==2428== Syscall param close(fd) contains uninitialised byte(s)
==2428==    at 0x4E4278D: ??? (in /lib64/libpthread-2.17.so)
==2428==    by 0x40F26F: flb_pipe_close (flb_pipe.c:107)
==2428==    by 0x40B67D: test_timeout_destroy (flb_event_loop.c:106)
==2428==    by 0x40D3F5: test_inject_event_priority_loop (flb_event_loop.c:640)
==2428==    by 0x4096D8: test_do_run_ (acutest.h:1007)
==2428==    by 0x40994D: test_run_ (acutest.h:1103)
==2428==    by 0x40AC35: main (acutest.h:1700)
==2428==
==2428==
==2428== HEAP SUMMARY:
==2428==     in use at exit: 3,728 bytes in 7 blocks
==2428==   total heap usage: 23 allocs, 16 frees, 10,608 bytes allocated
==2428==
==2428== LEAK SUMMARY:
==2428==    definitely lost: 0 bytes in 0 blocks
==2428==    indirectly lost: 0 bytes in 0 blocks
==2428==      possibly lost: 3,648 bytes in 6 blocks
==2428==    still reachable: 80 bytes in 1 blocks
==2428==         suppressed: 0 bytes in 0 blocks
==2428== Rerun with --leak-check=full to see details of leaked memory

zhenyami · 2022-07-18T15:55:06Z

Ready for review.
I don't know what to do about the MacOS test failure. I see some build issues are being worked on in other PRs.

Fluent Bit sometimes doesn't resume coroutines for timed out network events. This can also cause deadlocks in some cases, like AWS provider credentials refresh function, if the control is not returned to the function that sets a lock before a network call. This is caused by the priority event loop handling of network timeout events. That kind of event is injected into the loop and then added to the bucket queue, but only once. If the event is not handled after being removed from the bucket queue, such event is not re-added to the queue, and not processed on loop re-entry. This situation happens because the event is first removed, and then the loop condition is checked. * Changed the order of statements in priority queue loop macro: we now check if the loop has reached the iteration limit before removing the event from the bucket queue. * Added a test for event priority loop. It validates that when the event loop finishes all iterations, and the injected event is next in the priority queue, then this event will be processed next time the priority event loop runs. Signed-off-by: zhenyami <[email protected]>

matthewfala · 2022-08-15T17:45:44Z

@zhenyami, this change is significant! It's clear you put effort into understanding the event loop system and making appropriate changes.

@edsiper It seems that this is a legitimate problem with the priority event loop that will cause injected events to be dropped in edge cases. Would it be possible prioritizing merging this into Fluent Bit?

The test case looks good.

3 priority 0 events are added, with 1 priority 1 event injected {0,0,0,1}
A loop of max 3 iterations loops over the 3 priority 0 events {0,0,0} (potentially drops event 3+1=4, the injected event)
3 priority 2 events are added {1,2,2,2}
A loop of max 4 iterations loops over the events {1,2,2,2} checking priorities are handled in the correct order.

matthewfala · 2022-09-06T22:47:36Z

@leonardo-albertovich Is there any progress on reviewing this PR. Seems like an important fix to a bug in the event loop. Not sure how widespread issues that may arise might be, but I would think at some point people will complain about fluent bit freezing up and that would be linked back to this PR's highlighted issue.

leonardo-albertovich · 2022-09-07T08:07:55Z

I haven't but if it others have that's OK by me.

edsiper · 2022-09-09T04:49:16Z

folks, do we have 100% confidence on this PR and tests done ?

I see the unit test here:

https://github.com/fluent/fluent-bit/pull/5649/files#diff-aa41e6d7b7569a427563d189aa551f0563a8d2b3e3f25ea48b556150f308a785R582

but I want to make sure the unit test reproduces the original issue

lubingfeng · 2022-09-09T06:05:54Z

@matthewfala and @PettitWesley can you confirm if all required tests are done and we have full confidence on this fix.

github-actions · 2022-12-09T02:04:10Z

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

yackushevas · 2022-12-16T14:55:41Z

up

github-actions · 2023-03-17T01:59:50Z

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

leonardo-albertovich · 2023-03-17T17:01:18Z

@matthewfala do you think this is still relevant? Do you think we could wrap it up?

matthewfala · 2023-03-29T23:35:52Z

include/fluent-bit/flb_event_loop.h

+        (__flb_event_priority_live_foreach_iter < max_iter || max_iter == -1)           \
+        && (NULL != (                                                                   \
+            event = flb_bucket_queue_find_min(bktq)                                     \
+                    ? mk_list_entry(flb_bucket_queue_pop_min(bktq),                     \
+                                    struct mk_event,                                    \
+                                    _priority_head)                                     \
+                    : NULL                                                              \
+        ));                                                                             \


We can refactor this later, but it think this section would be more readable if we had it as:

(__flb_event_priority_live_foreach_iter < max_iter || max_iter == -1) \ && !flb_bucket_queue_is_empty(bktq) \ && (event = mk_list_entry(flb_bucket_queue_pop_min(bktq), \ struct mk_event, \ _priority_head))

I don't understand this, why would we want to get the event in the check condition for the for loop??

Also, the new condition is added with logical AND &&, then won't this mean the loop will still terminate when the first __flb_event_priority_live_foreach_iter < max_iter returns false, and the statement after the AND won't run. By which I mean- I don't understand the difference between this and making the new bits be the first statements in the loop. (As in not part of the for condition definition).

??

Here's how the for loop parts are run:

Init

Condition

Code

Update

Condition

If you take an event out in the update and the condition aborts, then you risk loosing one event

Update - take event out of bucket queue

Condition - potentially abort, which gets rid of the event.

With this code change:
We don't remove the next event from the bucket queue if max-iter is reached.

Ok I think I might get it now...

https://stackoverflow.com/questions/33457399/when-does-the-loop-variable-in-a-for-loop-get-updated

matthewfala · 2023-03-29T23:36:34Z

@leonardo-albertovich, A while back, I looked into this deeply and believe that the tests are sufficient to cause the bug and show that the solution works.

I'm confident in the presented solution. However in the future we can refactor with the suggested change to make it slightly more readable.

PettitWesley · 2023-03-30T00:45:15Z

@matthewfala

Fluent Bit sometimes doesn't resume coroutines for timed out network events. This can also cause deadlocks in some cases, like AWS credentials provider refresh function, if the control is not returned to the function that sets a lock before a network call.

The issue description kind of sounds like the mk_event_inject bug that we found right? #6822

leonardo-albertovich · 2023-03-30T06:44:16Z

I'm out of the loop so I trust your judgment @matthewfala.

As for the bug mentioned by @PettitWesley, I don't think it's related, of course, the priority event loop is the root of both but that's as far as it goes from what I see. I guess there's a lesson to be learned in it.

matthewfala · 2023-03-31T19:00:42Z

@PettitWesley I believe these bugs are not related. I think this code is ready to be merged.

github-actions · 2023-06-30T02:06:14Z

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions · 2024-11-26T02:08:35Z

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions bot added the docs-required label Jun 27, 2022

matthewfala reviewed Jun 27, 2022

View reviewed changes

zhenyami force-pushed the event-loop-priority-macro-fix-order branch from 9968c5b to cb81209 Compare July 13, 2022 21:22

zhenyami force-pushed the event-loop-priority-macro-fix-order branch 5 times, most recently from d70b543 to e0f800b Compare July 14, 2022 06:02

zhenyami marked this pull request as ready for review July 15, 2022 19:21

zhenyami requested review from edsiper, leonardo-albertovich, fujimotos and koleini as code owners July 15, 2022 19:21

zhenyami force-pushed the event-loop-priority-macro-fix-order branch 2 times, most recently from 7339ba7 to d5cf8e9 Compare July 15, 2022 23:29

zhenyami marked this pull request as draft July 15, 2022 23:29

zhenyami marked this pull request as ready for review July 18, 2022 15:53

zhenyami force-pushed the event-loop-priority-macro-fix-order branch from d5cf8e9 to 280aeca Compare July 25, 2022 20:14

lecaros mentioned this pull request Oct 6, 2022

aws imds: coroutine not resumed on connection timeout, locking output #5553

Closed

github-actions bot added the Stale label Dec 9, 2022

github-actions bot removed the Stale label Dec 17, 2022

github-actions bot added the Stale label Mar 17, 2023

github-actions bot removed the Stale label Mar 18, 2023

matthewfala reviewed Mar 29, 2023

View reviewed changes

github-actions bot added the Stale label Jun 30, 2023

matthewfala mentioned this pull request Jul 14, 2023

Injected Events Processed After Cleanup Code #7704

Closed

github-actions bot removed the Stale label Aug 15, 2024

github-actions bot added the Stale label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

event_loop: Check loop condition before removing event from bucket queue #5649

event_loop: Check loop condition before removing event from bucket queue #5649

zhenyami commented Jun 27, 2022 •

edited

Loading

edsiper commented Jun 27, 2022

matthewfala left a comment

matthewfala Jun 27, 2022

zhenyami Jun 28, 2022 •

edited

Loading

edsiper commented Jul 11, 2022

zhenyami commented Jul 13, 2022 •

edited

Loading

zhenyami commented Jul 13, 2022 •

edited

Loading

zhenyami commented Jul 18, 2022

matthewfala commented Aug 15, 2022

matthewfala commented Sep 6, 2022

leonardo-albertovich commented Sep 7, 2022

edsiper commented Sep 9, 2022

lubingfeng commented Sep 9, 2022

github-actions bot commented Dec 9, 2022

yackushevas commented Dec 16, 2022 •

edited

Loading

github-actions bot commented Mar 17, 2023

leonardo-albertovich commented Mar 17, 2023

matthewfala Mar 29, 2023

PettitWesley Mar 30, 2023

matthewfala Mar 31, 2023

PettitWesley Mar 31, 2023

PettitWesley Mar 31, 2023

matthewfala commented Mar 29, 2023

PettitWesley commented Mar 30, 2023

leonardo-albertovich commented Mar 30, 2023

matthewfala commented Mar 31, 2023

github-actions bot commented Jun 30, 2023

github-actions bot commented Nov 26, 2024

event_loop: Check loop condition before removing event from bucket queue #5649

Are you sure you want to change the base?

event_loop: Check loop condition before removing event from bucket queue #5649

Conversation

zhenyami commented Jun 27, 2022 • edited Loading

Fixes #5553

Related changes

edsiper commented Jun 27, 2022

matthewfala left a comment

Choose a reason for hiding this comment

matthewfala Jun 27, 2022

Choose a reason for hiding this comment

zhenyami Jun 28, 2022 • edited Loading

Choose a reason for hiding this comment

edsiper commented Jul 11, 2022

zhenyami commented Jul 13, 2022 • edited Loading

Additional changes

zhenyami commented Jul 13, 2022 • edited Loading

Valgrind output

zhenyami commented Jul 18, 2022

matthewfala commented Aug 15, 2022

matthewfala commented Sep 6, 2022

leonardo-albertovich commented Sep 7, 2022

edsiper commented Sep 9, 2022

lubingfeng commented Sep 9, 2022

github-actions bot commented Dec 9, 2022

yackushevas commented Dec 16, 2022 • edited Loading

github-actions bot commented Mar 17, 2023

leonardo-albertovich commented Mar 17, 2023

matthewfala Mar 29, 2023

Choose a reason for hiding this comment

PettitWesley Mar 30, 2023

Choose a reason for hiding this comment

matthewfala Mar 31, 2023

Choose a reason for hiding this comment

PettitWesley Mar 31, 2023

Choose a reason for hiding this comment

PettitWesley Mar 31, 2023

Choose a reason for hiding this comment

matthewfala commented Mar 29, 2023

PettitWesley commented Mar 30, 2023

leonardo-albertovich commented Mar 30, 2023

matthewfala commented Mar 31, 2023

github-actions bot commented Jun 30, 2023

github-actions bot commented Nov 26, 2024

zhenyami commented Jun 27, 2022 •

edited

Loading

zhenyami Jun 28, 2022 •

edited

Loading

zhenyami commented Jul 13, 2022 •

edited

Loading

zhenyami commented Jul 13, 2022 •

edited

Loading

yackushevas commented Dec 16, 2022 •

edited

Loading