Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in Envoy 1.32.1 #37769

Open
dmitriyblok opened this issue Dec 19, 2024 · 23 comments
Open

Segmentation fault in Envoy 1.32.1 #37769

dmitriyblok opened this issue Dec 19, 2024 · 23 comments

Comments

@dmitriyblok
Copy link

Title: Segmentation fault in Envoy 1.32.1

Description:
Envoy should not crash when one of the Redis cluster replicas is unreachable by DNS query. This issue has been triaged by Envoy Security.

Repro steps:
I can consistently reproduce by having a cluster configured with static service discovery and by having an unresolvable by DNS query Redis replica.

Config:
connect_timeout: 0.25s
dns_lookup_family: V4_ONLY
load_assignment:
cluster_name: {{ instance.name }}
endpoints:
{% for endpoint in instance.endpoints %}
- lb_endpoints:
{% for port in endpoint.ports %}
- endpoint:
address:
socket_address:
address: {{ endpoint.address }}
port_value: {{ port }}
{% endfor %}
{% endfor %}
cluster_type:
name: envoy.clusters.redis
typed_config:
"@type": type.googleapis.com/google.protobuf.Struct
value:
cluster_refresh_rate: 5s
cluster_refresh_timeout: 1s

Logs:

Include the access logs and the Envoy logs.

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Call Stack:
thread #1, name = 'envoy', stop reason = signal SIGSEGV

  • frame #0: 0x000025773f135800
    frame Initial import #1: 0x0000564c0c56e67b envoyEnvoy::Extensions::Clusters::Redis::RedisCluster::onClusterSlotUpdate(std::__1::shared_ptr<std::__1::vector<Envoy::Extensions::Clusters::Redis::ClusterSlot, std::__1::allocator<Envoy::Extensions::Clusters::Redis::ClusterSlot> > >&&) + 1435 frame #2: 0x0000564c0c57c5ee envoystd::__1::__function::__func<Envoy::Extensions::Clusters::Redis::RedisCluster::RedisDiscoverySession::resolveReplicas(std::__1::shared_ptr<std::__1::vector<Envoy::Extensions::Clusters::Redis::ClusterSlot, std::__1::allocatorEnvoy::Extensions::Clusters::Redis::ClusterSlot > >, unsigned long, std::__1::shared_ptr)::$_11, std::__1::allocator<Envoy::Extensions::Clusters::Redis::RedisCluster::RedisDiscoverySession::resolveReplicas(std::__1::shared_ptr<std::__1::vector<Envoy::Extensions::Clusters::Redis::ClusterSlot, std::__1::allocatorEnvoy::Extensions::Clusters::Redis::ClusterSlot > >, unsigned long, std::__1::shared_ptr)::$_11>, void (Envoy::Network::DnsResolver::ResolutionStatus, std::__1::basic_string_view<char, std::__1::char_traits >, std::__1::list<Envoy::Network::DnsResponse, std::__1::allocatorEnvoy::Network::DnsResponse >&&)>::operator()(Envoy::Network::DnsResolver::ResolutionStatus&&, std::__1::basic_string_view<char, std::__1::char_traits >&&, std::__1::list<Envoy::Network::DnsResponse, std::__1::allocatorEnvoy::Network::DnsResponse >&&) + 958
    frame network filter: fix upstream host storage #3: 0x0000564c0dd9af90 envoyEnvoy::Network::DnsResolverImpl::PendingResolution::finishResolve() + 2544 frame #4: 0x0000564c0dd99cfa envoyEnvoy::Network::DnsResolverImpl::AddrInfoPendingResolution::onAresGetAddrInfoCallback(int, int, ares_addrinfo*) + 5370
    frame More details around failure reasons. #5: 0x0000564c0dda6b3c envoyend_hquery + 140 frame #6: 0x0000564c0ddaeb73 envoyqcallback + 19
    frame read --build-id from root cmake project during linking #7: 0x0000564c0dda52e1 envoyares_destroy + 97 frame #8: 0x0000564c0dd979a2 envoyEnvoy::Network::DnsResolverImpl::~DnsResolverImpl() + 50
    frame add x-envoy-upstream-rq-per-try-timeout-ms router header option #9: 0x0000564c0df95da5 envoyEnvoy::Server::InstanceBase::~InstanceBase() + 1669 frame #10: 0x0000564c0df8e725 envoyEnvoy::Server::InstanceImpl::~InstanceImpl() + 101
    frame ci: do asan build #11: 0x0000564c0df482c0 envoyEnvoy::StrippedMainBase::~StrippedMainBase() + 80 frame #12: 0x0000564c0df49874 envoystd::__1::default_deleteEnvoy::MainCommon::operator()(Envoy::MainCommon*) const + 84
    frame docs: deployment types #13: 0x0000564c0df48c2d envoyEnvoy::MainCommon::main(int, char**, std::__1::function<void (Envoy::Server::Instance&)>) + 157 frame #14: 0x0000564c0c4ec14c envoymain + 44
    frame docs: misc #15: 0x00007f72e9242d90 libc.so.6___lldb_unnamed_symbol118$$libc.so.6 + 2192 frame #16: 0x0000564c0c4ec000 envoy thread #2, stop reason = signal 0 frame #0: 0x00007f72e933788d libc.so.6__libc_ifunc_impl_list + 5677
    thread network filter: fix upstream host storage #3, stop reason = signal 0
    frame #0: 0x00007f72e933788d libc.so.6__libc_ifunc_impl_list + 5677 thread #4, stop reason = signal 0 frame #0: 0x00007f72e933788d libc.so.6__libc_ifunc_impl_list + 5677
    thread More details around failure reasons. #5, stop reason = signal 0
    frame #0: 0x00007f72e933788d libc.so.6`__libc_ifunc_impl_list + 5677
@dmitriyblok dmitriyblok added bug triage Issue requires triage labels Dec 19, 2024
@adisuissa adisuissa added area/redis area/dns and removed triage Issue requires triage labels Dec 20, 2024
@adisuissa
Copy link
Contributor

cc @msukalski @HenryYYang @mattklein123 @weisisea as codeowners

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jan 19, 2025
@dmitriyblok
Copy link
Author

Is there an update on the status or an estimate on when it will be looked at?

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Jan 19, 2025
@ravenblackx
Copy link
Contributor

ravenblackx commented Jan 28, 2025

@dmitriyblok Any chance you can reproduce this with an asan build? The crash log would be much more enlightening that way, and it seems likely you could repro since your crash is more reproducible than ours. (We have a similar crash [details in #38143] but when we run with an asan build it no longer crashes.)

@dmitriyblok
Copy link
Author

TBH it would take me quite some time and a lot of effort to deploy asan build. I can confirm it is still happening in envoy 1.32.3. Have you tried to emulate DNS failure by adding removing clusters? Just a thought.

@tabacco
Copy link

tabacco commented Jan 30, 2025

I've been adding some notes in #38143 but I'm pretty sure at this point that what I'm seeing is the issue reported in this ticket. I believe I've narrowed it down to being introduced in 1.31.0, as we're not seeing the segfault in 1.30.9 but can reproduce it in 1.31.0.

While I wait for a debug build of Envoy to happen, I've been doing a very unscientific reading of the 1.31.0 changelog and thse items really jump out given that there seem to be two different DNS-related issues:

cares: Upgraded c-ares library to 1.20.1 and added fix to c-ares DNS implementation to additionally check for ARES_EREFUSED, ARES_ESERVFAIL and ARES_ENOTIMP status. Without this fix, DestroyChannelOnRefused and CustomResolverValidAfterChannelDestruction unit test will break.

dns: Changes the behavior of the getaddrinfo DNS resolver so that it treats EAI_NODATA and EAI_NONAME as successful queries with empty results, instead of as DNS failures. This change brings the getaddrinfo behavior in-line with the c-ares resolver behavior. This behavior can be reverted by setting the runtime guard envoy.reloadable_features.dns_nodata_noname_is_success to false.

Out of curiosity, have you already experimented with setting envoy.reloadable_features.dns_nodata_noname_is_success to false? I notice #38256 just got merged a couple of hours ago to disable that runtime guard again.

(Edit: Ignore the struck bits, that was referring to the wrong resolver)

@fredyw
Copy link
Member

fredyw commented Jan 31, 2025

Out of curiosity, have you already experimented with setting envoy.reloadable_features.dns_nodata_noname_is_success to false? I notice #38256 just got merged a couple of hours ago to disable that runtime guard again.

That change is for the getaddrinfo resolver. I believe the issue in #38143 is for the c-ares resolver.

@tabacco
Copy link

tabacco commented Jan 31, 2025

Yep, you're right! I've been staring at this for too long apparently.

@ravenblackx
Copy link
Contributor

Even though the ASAN output was garbled, I think it's interesting in that I was expecting to see something about accessing something that's been deleted, but instead it was something about accessing an out of range memory address. I think that likely means a reference was taken to a pointer on the stack that has gone out of scope and been overwritten, rather than something that's been destructor-ed as I was previously expecting.

@ravenblackx
Copy link
Contributor

Trying to bring redis-related things to this thread since the start of the other thread is a not-redis-related crash; @fredyw's suggestion in the other thread was apparently not the cause of our redis-cluster crashes. I patched in a check for the response being empty before that response.front(), and the crash still occurs.

@tabacco
Copy link

tabacco commented Feb 4, 2025

Did you make the same change in resolveReplicas() as well?

*response.front().addrInfo().address_, replica.second));

I tried to do the same fix but I've been butting heads with the build tools all weekend trying to run tests because this is my first time diving into this project at all.

@ravenblackx
Copy link
Contributor

Haha, just got done going through debugging with a coredump and found it was indeed crashing in that resolveReplicas path. So now I've patched both the ~identical conditions, and trying again.

Patch is

diff --git a/source/extensions/clusters/redis/redis_cluster.cc b/source/extensions/clusters/redis/redis_cluster.cc
index 5fbe0eadb2..6223b03d9a 100644
--- a/source/extensions/clusters/redis/redis_cluster.cc
+++ b/source/extensions/clusters/redis/redis_cluster.cc
@@ -381,7 +381,7 @@ void RedisCluster::RedisDiscoverySession::resolveClusterHostnames(
             updateDnsStats(status, response.empty());
             // If DNS resolution for a primary fails, we stop resolution for remaining, and reset
             // the timer.
-            if (status != Network::DnsResolver::ResolutionStatus::Completed) {
+            if (status != Network::DnsResolver::ResolutionStatus::Completed || response.empty()) {
               ENVOY_LOG(error, "Unable to resolve cluster slot primary hostname {}",
                         slot.primary_hostname_);
               resolve_timer_->enableTimer(parent_.cluster_refresh_rate_);
@@ -435,7 +435,7 @@ void RedisCluster::RedisDiscoverySession::resolveReplicas(
           updateDnsStats(status, response.empty());
           // If DNS resolution fails here, we move on to resolve other replicas in the list.
           // We log a warn message.
-          if (status != Network::DnsResolver::ResolutionStatus::Completed) {
+          if (status != Network::DnsResolver::ResolutionStatus::Completed || response.empty()) {
             ENVOY_LOG(warn, "Unable to resolve cluster replica address {}", replica.first);
           } else {
             // Replica resolved

@ravenblackx
Copy link
Contributor

(I checked around a bit first to see whether actually the status was the error, and

} else if (isResponseWithNoRecords(status)) {
// Treat `ARES_ENODATA` or `ARES_ENOTFOUND` here as success to populate back the
// "empty records" response.
pending_response_.status_ = ResolutionStatus::Completed;
pending_response_.details_ = absl::StrCat("cares_norecords:", ares_strerror(status));
ASSERT(addrinfo == nullptr);
} else {
pending_response_.details_ = absl::StrCat("cares_failure:", ares_strerror(status));
}
suggests to me that it's intentional that Completed+empty is a valid state - also the updateDnsStats arguments imply the same.)

@ravenblackx
Copy link
Contributor

ravenblackx commented Feb 4, 2025

(Though I think it's likely that a timeout, which is what's happening in our case, should be a different status nonetheless.)

#11 Envoy::Network::DnsResolverImpl::AddrInfoPendingResolution::onAresGetAddrInfoCallback (this=0x36cabb96c000, status=<optimized out>, timeouts=<optimized out>, addrinfo=0x0)
#12 end_hquery (hquery=0x36cabb3a7580, status=ARES_ETIMEOUT)

@tabacco
Copy link

tabacco commented Feb 4, 2025

Yeah a timeout should definitely be treated more like a SERVFAIL or similar, since it's a query that would probably normally return records, but isn't currently working.

Completed+empty does seem legit in the case of, for example, a fully scaled-down cluster.

@tabacco
Copy link

tabacco commented Feb 4, 2025

What's annoying is that c-ares doesn't return a specific status code for a failure due to timeouts, simply a count of timeouts and a response that's not ARES_SUCCESS (based on the log behavior). I suppose it's possible to look at the configured maximum number of timeouts from https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/network/dns_resolver/cares/v3/cares_dns_resolver.proto#envoy-v3-api-msg-extensions-network-dns-resolver-cares-v3-caresdnsresolverconfig as a comparison there to determine if timeouts were the cause of the nonsuccess.

@ravenblackx
Copy link
Contributor

What's annoying is that c-ares doesn't return a specific status code for a failure due to timeouts, simply a count of timeouts

end_hquery (hquery=0x36cabb3a7580, status=ARES_ETIMEOUT) looks like there's a specific status there? Looking at the source of end_hquery it looks like it gets passed unaltered to the callback, too, so I would expect onAresGetAddrInfoCallback to be also receiving ARES_TIMEOUT as the status.

@tabacco
Copy link

tabacco commented Feb 4, 2025

Oh you're right, because we're hitting this:

"dns resolution for {} failed with c-ares status {}", dns_name_, status);

nothing actually checks for that specifically, though.

@ravenblackx
Copy link
Contributor

Yeah, it's envoy being annoying not c-ares. :D

@tabacco
Copy link

tabacco commented Feb 4, 2025

Sorry I doubted you, c-ares.

@tabacco
Copy link

tabacco commented Feb 7, 2025

So I believe 1.33.0 should "fix" this. It doesn't actually fix the redis cluster issue, but it does put the dns timeouts back where they were before c-ares reduced them in 1.20 (which was pulled into Envoy with 1.31.0)

I'm going to let it run over the weekend and report back.

The redis cluster issue is still a bug that can/will happen though.

@dmitriyblok
Copy link
Author

I can still see the crash in 1.33.0

@tabacco
Copy link

tabacco commented Feb 8, 2025

Oh well... @ravenblackx Since you have a patch going, are you planning to work on a PR or should I try to get one started?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants