Segmentation fault in Envoy 1.32.1 #37769

dmitriyblok · 2024-12-19T18:49:11Z

Title: Segmentation fault in Envoy 1.32.1

Description:
Envoy should not crash when one of the Redis cluster replicas is unreachable by DNS query. This issue has been triaged by Envoy Security.

Repro steps:
I can consistently reproduce by having a cluster configured with static service discovery and by having an unresolvable by DNS query Redis replica.

Config:
connect_timeout: 0.25s
dns_lookup_family: V4_ONLY
load_assignment:
cluster_name: {{ instance.name }}
endpoints:
{% for endpoint in instance.endpoints %}
- lb_endpoints:
{% for port in endpoint.ports %}
- endpoint:
address:
socket_address:
address: {{ endpoint.address }}
port_value: {{ port }}
{% endfor %}
{% endfor %}
cluster_type:
name: envoy.clusters.redis
typed_config:
"@type": type.googleapis.com/google.protobuf.Struct
value:
cluster_refresh_rate: 5s
cluster_refresh_timeout: 1s

Logs:

Include the access logs and the Envoy logs.

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Call Stack:
thread #1, name = 'envoy', stop reason = signal SIGSEGV

frame #0: 0x000025773f135800
frame Initial import #1: 0x0000564c0c56e67b envoyEnvoy::Extensions::Clusters::Redis::RedisCluster::onClusterSlotUpdate(std::__1::shared_ptr<std::__1::vector<Envoy::Extensions::Clusters::Redis::ClusterSlot, std::__1::allocator<Envoy::Extensions::Clusters::Redis::ClusterSlot> > >&&) + 1435 frame #2: 0x0000564c0c57c5ee envoystd::__1::__function::__func<Envoy::Extensions::Clusters::Redis::RedisCluster::RedisDiscoverySession::resolveReplicas(std::__1::shared_ptr<std::__1::vector<Envoy::Extensions::Clusters::Redis::ClusterSlot, std::__1::allocatorEnvoy::Extensions::Clusters::Redis::ClusterSlot > >, unsigned long, std::__1::shared_ptr)::$_11, std::__1::allocator<Envoy::Extensions::Clusters::Redis::RedisCluster::RedisDiscoverySession::resolveReplicas(std::__1::shared_ptr<std::__1::vector<Envoy::Extensions::Clusters::Redis::ClusterSlot, std::__1::allocatorEnvoy::Extensions::Clusters::Redis::ClusterSlot > >, unsigned long, std::__1::shared_ptr)::$_11>, void (Envoy::Network::DnsResolver::ResolutionStatus, std::__1::basic_string_view<char, std::__1::char_traits >, std::__1::list<Envoy::Network::DnsResponse, std::__1::allocatorEnvoy::Network::DnsResponse >&&)>::operator()(Envoy::Network::DnsResolver::ResolutionStatus&&, std::__1::basic_string_view<char, std::__1::char_traits >&&, std::__1::list<Envoy::Network::DnsResponse, std::__1::allocatorEnvoy::Network::DnsResponse >&&) + 958
frame network filter: fix upstream host storage #3: 0x0000564c0dd9af90 envoyEnvoy::Network::DnsResolverImpl::PendingResolution::finishResolve() + 2544 frame #4: 0x0000564c0dd99cfa envoyEnvoy::Network::DnsResolverImpl::AddrInfoPendingResolution::onAresGetAddrInfoCallback(int, int, ares_addrinfo*) + 5370
frame More details around failure reasons. #5: 0x0000564c0dda6b3c envoyend_hquery + 140 frame #6: 0x0000564c0ddaeb73 envoyqcallback + 19
frame read --build-id from root cmake project during linking #7: 0x0000564c0dda52e1 envoyares_destroy + 97 frame #8: 0x0000564c0dd979a2 envoyEnvoy::Network::DnsResolverImpl::~DnsResolverImpl() + 50
frame add x-envoy-upstream-rq-per-try-timeout-ms router header option #9: 0x0000564c0df95da5 envoyEnvoy::Server::InstanceBase::~InstanceBase() + 1669 frame #10: 0x0000564c0df8e725 envoyEnvoy::Server::InstanceImpl::~InstanceImpl() + 101
frame ci: do asan build #11: 0x0000564c0df482c0 envoyEnvoy::StrippedMainBase::~StrippedMainBase() + 80 frame #12: 0x0000564c0df49874 envoystd::__1::default_deleteEnvoy::MainCommon::operator()(Envoy::MainCommon*) const + 84
frame docs: deployment types #13: 0x0000564c0df48c2d envoyEnvoy::MainCommon::main(int, char**, std::__1::function<void (Envoy::Server::Instance&)>) + 157 frame #14: 0x0000564c0c4ec14c envoymain + 44
frame docs: misc #15: 0x00007f72e9242d90 libc.so.6___lldb_unnamed_symbol118$$libc.so.6 + 2192 frame #16: 0x0000564c0c4ec000 envoy thread #2, stop reason = signal 0 frame #0: 0x00007f72e933788d libc.so.6__libc_ifunc_impl_list + 5677
thread network filter: fix upstream host storage #3, stop reason = signal 0
frame #0: 0x00007f72e933788d libc.so.6__libc_ifunc_impl_list + 5677 thread #4, stop reason = signal 0 frame #0: 0x00007f72e933788d libc.so.6__libc_ifunc_impl_list + 5677
thread More details around failure reasons. #5, stop reason = signal 0
frame #0: 0x00007f72e933788d libc.so.6`__libc_ifunc_impl_list + 5677

The text was updated successfully, but these errors were encountered:

adisuissa · 2024-12-20T14:03:56Z

cc @msukalski @HenryYYang @mattklein123 @weisisea as codeowners

github-actions · 2025-01-19T16:01:27Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

dmitriyblok · 2025-01-19T19:17:52Z

Is there an update on the status or an estimate on when it will be looked at?

ravenblackx · 2025-01-28T20:04:21Z

@dmitriyblok Any chance you can reproduce this with an asan build? The crash log would be much more enlightening that way, and it seems likely you could repro since your crash is more reproducible than ours. (We have a similar crash [details in #38143] but when we run with an asan build it no longer crashes.)

dmitriyblok · 2025-01-28T20:39:53Z

TBH it would take me quite some time and a lot of effort to deploy asan build. I can confirm it is still happening in envoy 1.32.3. Have you tried to emulate DNS failure by adding removing clusters? Just a thought.

tabacco · 2025-01-30T23:30:57Z

I've been adding some notes in #38143 but I'm pretty sure at this point that what I'm seeing is the issue reported in this ticket. I believe I've narrowed it down to being introduced in 1.31.0, as we're not seeing the segfault in 1.30.9 but can reproduce it in 1.31.0.

While I wait for a debug build of Envoy to happen, I've been doing a very unscientific reading of the 1.31.0 changelog and thse items really jump out given that there seem to be two different DNS-related issues:

cares: Upgraded c-ares library to 1.20.1 and added fix to c-ares DNS implementation to additionally check for ARES_EREFUSED, ARES_ESERVFAIL and ARES_ENOTIMP status. Without this fix, DestroyChannelOnRefused and CustomResolverValidAfterChannelDestruction unit test will break.

dns: Changes the behavior of the getaddrinfo DNS resolver so that it treats EAI_NODATA and EAI_NONAME as successful queries with empty results, instead of as DNS failures. This change brings the getaddrinfo behavior in-line with the c-ares resolver behavior. This behavior can be reverted by setting the runtime guard envoy.reloadable_features.dns_nodata_noname_is_success to false.

Out of curiosity, have you already experimented with setting envoy.reloadable_features.dns_nodata_noname_is_success to false? I notice #38256 just got merged a couple of hours ago to disable that runtime guard again.

(Edit: Ignore the struck bits, that was referring to the wrong resolver)

fredyw · 2025-01-31T00:03:32Z

Out of curiosity, have you already experimented with setting envoy.reloadable_features.dns_nodata_noname_is_success to false? I notice #38256 just got merged a couple of hours ago to disable that runtime guard again.

That change is for the getaddrinfo resolver. I believe the issue in #38143 is for the c-ares resolver.

tabacco · 2025-01-31T00:05:00Z

Yep, you're right! I've been staring at this for too long apparently.

ravenblackx · 2025-01-31T15:40:13Z

Even though the ASAN output was garbled, I think it's interesting in that I was expecting to see something about accessing something that's been deleted, but instead it was something about accessing an out of range memory address. I think that likely means a reference was taken to a pointer on the stack that has gone out of scope and been overwritten, rather than something that's been destructor-ed as I was previously expecting.

ravenblackx · 2025-02-04T15:38:16Z

Trying to bring redis-related things to this thread since the start of the other thread is a not-redis-related crash; @fredyw's suggestion in the other thread was apparently not the cause of our redis-cluster crashes. I patched in a check for the response being empty before that response.front(), and the crash still occurs.

tabacco · 2025-02-04T16:21:08Z

Did you make the same change in resolveReplicas() as well?

envoy/source/extensions/clusters/redis/redis_cluster.cc

Line 443 in f683f26

*response.front().addrInfo().address_, replica.second));

I tried to do the same fix but I've been butting heads with the build tools all weekend trying to run tests because this is my first time diving into this project at all.

ravenblackx · 2025-02-04T17:33:20Z

Haha, just got done going through debugging with a coredump and found it was indeed crashing in that resolveReplicas path. So now I've patched both the ~identical conditions, and trying again.

Patch is

diff --git a/source/extensions/clusters/redis/redis_cluster.cc b/source/extensions/clusters/redis/redis_cluster.cc
index 5fbe0eadb2..6223b03d9a 100644
--- a/source/extensions/clusters/redis/redis_cluster.cc
+++ b/source/extensions/clusters/redis/redis_cluster.cc
@@ -381,7 +381,7 @@ void RedisCluster::RedisDiscoverySession::resolveClusterHostnames(
             updateDnsStats(status, response.empty());
             // If DNS resolution for a primary fails, we stop resolution for remaining, and reset
             // the timer.
-            if (status != Network::DnsResolver::ResolutionStatus::Completed) {
+            if (status != Network::DnsResolver::ResolutionStatus::Completed || response.empty()) {
               ENVOY_LOG(error, "Unable to resolve cluster slot primary hostname {}",
                         slot.primary_hostname_);
               resolve_timer_->enableTimer(parent_.cluster_refresh_rate_);
@@ -435,7 +435,7 @@ void RedisCluster::RedisDiscoverySession::resolveReplicas(
           updateDnsStats(status, response.empty());
           // If DNS resolution fails here, we move on to resolve other replicas in the list.
           // We log a warn message.
-          if (status != Network::DnsResolver::ResolutionStatus::Completed) {
+          if (status != Network::DnsResolver::ResolutionStatus::Completed || response.empty()) {
             ENVOY_LOG(warn, "Unable to resolve cluster replica address {}", replica.first);
           } else {
             // Replica resolved

ravenblackx · 2025-02-04T17:35:35Z

(I checked around a bit first to see whether actually the status was the error, and

envoy/source/extensions/network/dns_resolver/cares/dns_impl.cc

Lines 283 to 291 in 057f7e0

    
           } else if (isResponseWithNoRecords(status)) { 
        
             // Treat `ARES_ENODATA` or `ARES_ENOTFOUND` here as success to populate back the 
        
             // "empty records" response. 
        
             pending_response_.status_ = ResolutionStatus::Completed; 
        
             pending_response_.details_ = absl::StrCat("cares_norecords:", ares_strerror(status)); 
        
             ASSERT(addrinfo == nullptr); 
        
           } else { 
        
             pending_response_.details_ = absl::StrCat("cares_failure:", ares_strerror(status)); 
        
           }

suggests to me that it's intentional that Completed+empty is a valid state - also the updateDnsStats arguments imply the same.)

ravenblackx · 2025-02-04T17:36:14Z

(Though I think it's likely that a timeout, which is what's happening in our case, should be a different status nonetheless.)

#11 Envoy::Network::DnsResolverImpl::AddrInfoPendingResolution::onAresGetAddrInfoCallback (this=0x36cabb96c000, status=<optimized out>, timeouts=<optimized out>, addrinfo=0x0)
#12 end_hquery (hquery=0x36cabb3a7580, status=ARES_ETIMEOUT)

tabacco · 2025-02-04T18:12:19Z

Yeah a timeout should definitely be treated more like a SERVFAIL or similar, since it's a query that would probably normally return records, but isn't currently working.

Completed+empty does seem legit in the case of, for example, a fully scaled-down cluster.

tabacco · 2025-02-04T18:46:31Z

What's annoying is that c-ares doesn't return a specific status code for a failure due to timeouts, simply a count of timeouts and a response that's not ARES_SUCCESS (based on the log behavior). I suppose it's possible to look at the configured maximum number of timeouts from https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/network/dns_resolver/cares/v3/cares_dns_resolver.proto#envoy-v3-api-msg-extensions-network-dns-resolver-cares-v3-caresdnsresolverconfig as a comparison there to determine if timeouts were the cause of the nonsuccess.

ravenblackx · 2025-02-04T21:00:23Z

What's annoying is that c-ares doesn't return a specific status code for a failure due to timeouts, simply a count of timeouts

end_hquery (hquery=0x36cabb3a7580, status=ARES_ETIMEOUT) looks like there's a specific status there? Looking at the source of end_hquery it looks like it gets passed unaltered to the callback, too, so I would expect onAresGetAddrInfoCallback to be also receiving ARES_TIMEOUT as the status.

tabacco · 2025-02-04T21:10:37Z

Oh you're right, because we're hitting this:

envoy/source/extensions/network/dns_resolver/cares/dns_impl.cc

Line 189 in e224bdc

"dns resolution for {} failed with c-ares status {}", dns_name_, status);

nothing actually checks for that specifically, though.

ravenblackx · 2025-02-04T21:53:42Z

Yeah, it's envoy being annoying not c-ares. :D

tabacco · 2025-02-04T21:58:26Z

Sorry I doubted you, c-ares.

tabacco · 2025-02-07T23:20:54Z

So I believe 1.33.0 should "fix" this. It doesn't actually fix the redis cluster issue, but it does put the dns timeouts back where they were before c-ares reduced them in 1.20 (which was pulled into Envoy with 1.31.0)

I'm going to let it run over the weekend and report back.

The redis cluster issue is still a bug that can/will happen though.

dmitriyblok · 2025-02-08T00:10:53Z

I can still see the crash in 1.33.0

tabacco · 2025-02-08T00:14:36Z

Oh well... @ravenblackx Since you have a patch going, are you planning to work on a PR or should I try to get one started?

dmitriyblok added bug triage Issue requires triage labels Dec 19, 2024

adisuissa added area/redis area/dns and removed triage Issue requires triage labels Dec 20, 2024

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jan 19, 2025

github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Jan 19, 2025

tabacco mentioned this issue Jan 28, 2025

UAF in cluster's endpoint DNS resolution #38143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault in Envoy 1.32.1 #37769

Segmentation fault in Envoy 1.32.1 #37769

dmitriyblok commented Dec 19, 2024

adisuissa commented Dec 20, 2024

github-actions bot commented Jan 19, 2025

dmitriyblok commented Jan 19, 2025

ravenblackx commented Jan 28, 2025 •

edited

Loading

dmitriyblok commented Jan 28, 2025

tabacco commented Jan 30, 2025 •

edited

Loading

fredyw commented Jan 31, 2025

tabacco commented Jan 31, 2025

ravenblackx commented Jan 31, 2025

ravenblackx commented Feb 4, 2025

tabacco commented Feb 4, 2025 •

edited

Loading

ravenblackx commented Feb 4, 2025

ravenblackx commented Feb 4, 2025

ravenblackx commented Feb 4, 2025 •

edited

Loading

tabacco commented Feb 4, 2025 •

edited

Loading

tabacco commented Feb 4, 2025 •

edited

Loading

ravenblackx commented Feb 4, 2025

tabacco commented Feb 4, 2025

ravenblackx commented Feb 4, 2025

tabacco commented Feb 4, 2025

tabacco commented Feb 7, 2025 •

edited

Loading

dmitriyblok commented Feb 8, 2025

tabacco commented Feb 8, 2025 •

edited

Loading

Segmentation fault in Envoy 1.32.1 #37769

Segmentation fault in Envoy 1.32.1 #37769

Comments

dmitriyblok commented Dec 19, 2024

adisuissa commented Dec 20, 2024

github-actions bot commented Jan 19, 2025

dmitriyblok commented Jan 19, 2025

ravenblackx commented Jan 28, 2025 • edited Loading

dmitriyblok commented Jan 28, 2025

tabacco commented Jan 30, 2025 • edited Loading

fredyw commented Jan 31, 2025

tabacco commented Jan 31, 2025

ravenblackx commented Jan 31, 2025

ravenblackx commented Feb 4, 2025

tabacco commented Feb 4, 2025 • edited Loading

ravenblackx commented Feb 4, 2025

ravenblackx commented Feb 4, 2025

ravenblackx commented Feb 4, 2025 • edited Loading

tabacco commented Feb 4, 2025 • edited Loading

tabacco commented Feb 4, 2025 • edited Loading

ravenblackx commented Feb 4, 2025

tabacco commented Feb 4, 2025

ravenblackx commented Feb 4, 2025

tabacco commented Feb 4, 2025

tabacco commented Feb 7, 2025 • edited Loading

dmitriyblok commented Feb 8, 2025

tabacco commented Feb 8, 2025 • edited Loading

ravenblackx commented Jan 28, 2025 •

edited

Loading

tabacco commented Jan 30, 2025 •

edited

Loading

tabacco commented Feb 4, 2025 •

edited

Loading

ravenblackx commented Feb 4, 2025 •

edited

Loading

tabacco commented Feb 4, 2025 •

edited

Loading

tabacco commented Feb 4, 2025 •

edited

Loading

tabacco commented Feb 7, 2025 •

edited

Loading

tabacco commented Feb 8, 2025 •

edited

Loading