-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault in Envoy 1.32.1 #37769
Comments
cc @msukalski @HenryYYang @mattklein123 @weisisea as codeowners |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
Is there an update on the status or an estimate on when it will be looked at? |
@dmitriyblok Any chance you can reproduce this with an asan build? The crash log would be much more enlightening that way, and it seems likely you could repro since your crash is more reproducible than ours. (We have a similar crash [details in #38143] but when we run with an asan build it no longer crashes.) |
TBH it would take me quite some time and a lot of effort to deploy asan build. I can confirm it is still happening in envoy 1.32.3. Have you tried to emulate DNS failure by adding removing clusters? Just a thought. |
I've been adding some notes in #38143 but I'm pretty sure at this point that what I'm seeing is the issue reported in this ticket. I believe I've narrowed it down to being introduced in 1.31.0, as we're not seeing the segfault in 1.30.9 but can reproduce it in 1.31.0. While I wait for a debug build of Envoy to happen, I've been doing a very unscientific reading of the 1.31.0 changelog and thse items really jump out given that there seem to be two different DNS-related issues:
(Edit: Ignore the struck bits, that was referring to the wrong resolver) |
That change is for the getaddrinfo resolver. I believe the issue in #38143 is for the c-ares resolver. |
Yep, you're right! I've been staring at this for too long apparently. |
Even though the ASAN output was garbled, I think it's interesting in that I was expecting to see something about accessing something that's been deleted, but instead it was something about accessing an out of range memory address. I think that likely means a reference was taken to a pointer on the stack that has gone out of scope and been overwritten, rather than something that's been destructor-ed as I was previously expecting. |
Trying to bring redis-related things to this thread since the start of the other thread is a not-redis-related crash; @fredyw's suggestion in the other thread was apparently not the cause of our redis-cluster crashes. I patched in a check for the response being empty before that |
Did you make the same change in resolveReplicas() as well?
I tried to do the same fix but I've been butting heads with the build tools all weekend trying to run tests because this is my first time diving into this project at all. |
Haha, just got done going through debugging with a coredump and found it was indeed crashing in that resolveReplicas path. So now I've patched both the ~identical conditions, and trying again. Patch is
|
(I checked around a bit first to see whether actually the status was the error, and envoy/source/extensions/network/dns_resolver/cares/dns_impl.cc Lines 283 to 291 in 057f7e0
|
(Though I think it's likely that a timeout, which is what's happening in our case, should be a different status nonetheless.)
|
Yeah a timeout should definitely be treated more like a SERVFAIL or similar, since it's a query that would probably normally return records, but isn't currently working. Completed+empty does seem legit in the case of, for example, a fully scaled-down cluster. |
What's annoying is that c-ares doesn't return a specific status code for a failure due to timeouts, simply a count of timeouts and a response that's not ARES_SUCCESS (based on the log behavior). I suppose it's possible to look at the configured maximum number of timeouts from https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/network/dns_resolver/cares/v3/cares_dns_resolver.proto#envoy-v3-api-msg-extensions-network-dns-resolver-cares-v3-caresdnsresolverconfig as a comparison there to determine if timeouts were the cause of the nonsuccess. |
|
Oh you're right, because we're hitting this:
nothing actually checks for that specifically, though. |
Yeah, it's envoy being annoying not c-ares. :D |
Sorry I doubted you, c-ares. |
So I believe 1.33.0 should "fix" this. It doesn't actually fix the redis cluster issue, but it does put the dns timeouts back where they were before c-ares reduced them in 1.20 (which was pulled into Envoy with 1.31.0) I'm going to let it run over the weekend and report back. The redis cluster issue is still a bug that can/will happen though. |
I can still see the crash in 1.33.0 |
Oh well... @ravenblackx Since you have a patch going, are you planning to work on a PR or should I try to get one started? |
Title: Segmentation fault in Envoy 1.32.1
Description:
Envoy should not crash when one of the Redis cluster replicas is unreachable by DNS query. This issue has been triaged by Envoy Security.
Repro steps:
I can consistently reproduce by having a cluster configured with static service discovery and by having an unresolvable by DNS query Redis replica.
Config:
connect_timeout: 0.25s
dns_lookup_family: V4_ONLY
load_assignment:
cluster_name: {{ instance.name }}
endpoints:
{% for endpoint in instance.endpoints %}
- lb_endpoints:
{% for port in endpoint.ports %}
- endpoint:
address:
socket_address:
address: {{ endpoint.address }}
port_value: {{ port }}
{% endfor %}
{% endfor %}
cluster_type:
name: envoy.clusters.redis
typed_config:
"@type": type.googleapis.com/google.protobuf.Struct
value:
cluster_refresh_rate: 5s
cluster_refresh_timeout: 1s
Logs:
Call Stack:
thread #1, name = 'envoy', stop reason = signal SIGSEGV
frame Initial import #1: 0x0000564c0c56e67b envoy
Envoy::Extensions::Clusters::Redis::RedisCluster::onClusterSlotUpdate(std::__1::shared_ptr<std::__1::vector<Envoy::Extensions::Clusters::Redis::ClusterSlot, std::__1::allocator<Envoy::Extensions::Clusters::Redis::ClusterSlot> > >&&) + 1435 frame #2: 0x0000564c0c57c5ee envoy
std::__1::__function::__func<Envoy::Extensions::Clusters::Redis::RedisCluster::RedisDiscoverySession::resolveReplicas(std::__1::shared_ptr<std::__1::vector<Envoy::Extensions::Clusters::Redis::ClusterSlot, std::__1::allocatorEnvoy::Extensions::Clusters::Redis::ClusterSlot > >, unsigned long, std::__1::shared_ptr)::$_11, std::__1::allocator<Envoy::Extensions::Clusters::Redis::RedisCluster::RedisDiscoverySession::resolveReplicas(std::__1::shared_ptr<std::__1::vector<Envoy::Extensions::Clusters::Redis::ClusterSlot, std::__1::allocatorEnvoy::Extensions::Clusters::Redis::ClusterSlot > >, unsigned long, std::__1::shared_ptr)::$_11>, void (Envoy::Network::DnsResolver::ResolutionStatus, std::__1::basic_string_view<char, std::__1::char_traits >, std::__1::list<Envoy::Network::DnsResponse, std::__1::allocatorEnvoy::Network::DnsResponse >&&)>::operator()(Envoy::Network::DnsResolver::ResolutionStatus&&, std::__1::basic_string_view<char, std::__1::char_traits >&&, std::__1::list<Envoy::Network::DnsResponse, std::__1::allocatorEnvoy::Network::DnsResponse >&&) + 958frame network filter: fix upstream host storage #3: 0x0000564c0dd9af90 envoy
Envoy::Network::DnsResolverImpl::PendingResolution::finishResolve() + 2544 frame #4: 0x0000564c0dd99cfa envoy
Envoy::Network::DnsResolverImpl::AddrInfoPendingResolution::onAresGetAddrInfoCallback(int, int, ares_addrinfo*) + 5370frame More details around failure reasons. #5: 0x0000564c0dda6b3c envoy
end_hquery + 140 frame #6: 0x0000564c0ddaeb73 envoy
qcallback + 19frame read --build-id from root cmake project during linking #7: 0x0000564c0dda52e1 envoy
ares_destroy + 97 frame #8: 0x0000564c0dd979a2 envoy
Envoy::Network::DnsResolverImpl::~DnsResolverImpl() + 50frame add x-envoy-upstream-rq-per-try-timeout-ms router header option #9: 0x0000564c0df95da5 envoy
Envoy::Server::InstanceBase::~InstanceBase() + 1669 frame #10: 0x0000564c0df8e725 envoy
Envoy::Server::InstanceImpl::~InstanceImpl() + 101frame ci: do asan build #11: 0x0000564c0df482c0 envoy
Envoy::StrippedMainBase::~StrippedMainBase() + 80 frame #12: 0x0000564c0df49874 envoy
std::__1::default_deleteEnvoy::MainCommon::operator()(Envoy::MainCommon*) const + 84frame docs: deployment types #13: 0x0000564c0df48c2d envoy
Envoy::MainCommon::main(int, char**, std::__1::function<void (Envoy::Server::Instance&)>) + 157 frame #14: 0x0000564c0c4ec14c envoy
main + 44frame docs: misc #15: 0x00007f72e9242d90 libc.so.6
___lldb_unnamed_symbol118$$libc.so.6 + 2192 frame #16: 0x0000564c0c4ec000 envoy thread #2, stop reason = signal 0 frame #0: 0x00007f72e933788d libc.so.6
__libc_ifunc_impl_list + 5677thread network filter: fix upstream host storage #3, stop reason = signal 0
frame #0: 0x00007f72e933788d libc.so.6
__libc_ifunc_impl_list + 5677 thread #4, stop reason = signal 0 frame #0: 0x00007f72e933788d libc.so.6
__libc_ifunc_impl_list + 5677thread More details around failure reasons. #5, stop reason = signal 0
frame #0: 0x00007f72e933788d libc.so.6`__libc_ifunc_impl_list + 5677
The text was updated successfully, but these errors were encountered: