-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dns resolution for plugins using async mode does not consider all the DNS servers on the host #5862
Comments
Based on my debugging, I concluded that during the failure mode i.e. using async mode, we are following the codepath here. This calls In the happy path when using plugins with async mode disabled, the codepath followed is this one. This calls I also tried to compile the plugins by setting async mode disabled using |
Yes, due to some constraints in the original design the async dns client aborts after the first resolution error and a refactor is on the way but we don't have an ETA yet. If you are interested in contributing let me know, I wrote the code so I can probably help you if you have any questions. |
@leonardo-albertovich How come the setting I see here doesn't fix it: https://github.com/fluent/fluent-bit/blob/1.9/src/flb_upstream.c#L43 Does that not do anything? |
What that setting does is select between using c-ares which is asynchronous and the default system resolver which is synchronous. If having those DNS queries block is something you can accept then it should be fine and if you need to minimize the overhead you can use a non authoritative local caching DNS server like most modern distributions do (which in the end would be the same for both async and sync since the real query wouldn't be performed by fluent-bit). |
So we determined that setting this is a valid workaround for the problem:
For Windows container users, all outputs will need this option set. I am wondering if we could consider contributing a new environment variable that is like a global setting for We could hide this new setting or env var behind a new CMake flag which would default to |
|
Why is this flagged as windows ? I had the same problem on flatcar, so I'm assuming that this is valid from any OS, right ? |
Bug Report
Describe the bug
In fluent-bit, there can be plugins which are using async mode for performance improvement. This is the default setting and would be used by a lot of plugins.
Consider a scenario wherein the host has multiple DNS servers
[x.x.x.x, y.y.y.y]
such that the the first/primary DNS server(x.x.x.x)
does not resolve the endpoint but the secondary DNS Server(y.y.y.y)
resolves it correctly. In such cases, the plugins using async mode try the DNS resolution with the primary DNS only and fail without ever trying resolution with secondary DNS.The error for the same is-
This is in contrast to how DNS resolution should happen. The expected behaviour is for resolution to be tried using all the servers in DNS Server list before conceding error.
Note: This scenario works perfectly for plugins wherein async mode is disabled.
To Reproduce
Steps to reproduce the problem:
The issue can be replicated by following the listed steps-
Create a new Linux VM or just create a new container using
cr.fluentbit.io/fluent/fluent-bit:1.9.6-debug
Inside the container or VM, change the DNS setting to include an invalid DNS Server as the primary DNS server.
Start fluent-bit with http output plugin to send logs to a remote server.
Using
http://www.google.com:443
is the easiest repro of the issue as fluent-bit is unable to perform DNS resolution before any other thing happens. The same issue is applicable when using Kinesis Streams and Kinesis FIrehose plugins as well.We could not use HTTP benchmarking server on localhost as we need to use DNS resolution for the same. The same can be set on another machine and used here to replicate the issue.
Expected behavior
DNS resolution should happen with the secondary DNS Server before erroring out.
For the first example (www.google.com), the logs cannot be sent since it is not a valid destination and therefore, we will get an error 405 from the Google server. This essentially means that the DNS resolution worked fine.
Screenshots
Your Environment
Additional context
The text was updated successfully, but these errors were encountered: