-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS resolution fails due to einval error from gen_udp:connect #5092
Comments
This seems to be a Linux bug / pecularity in the implementation of If I first connect to an address on the loopback interface, then to a more public address, it fails with EINVAL from the actual Eshell V12.0.2 (abort with ^G)
1> {ok, S} = gen_udp:open(0, []).
{ok,#Port<0.5>}
2> gen_udp:connect(S, {127,0,0,53}, 53).
ok
3> gen_udp:connect(S, {8,8,8,8}, 53).
{error,einval}
4> gen_udp:close(S).
ok
5> f(S), {ok, S} = gen_udp:open(0, []).
{ok,#Port<0.6>}
6> gen_udp:connect(S, {8,8,8,8}, 53).
ok
7> gen_udp:connect(S, {127,0,0,53}, 53).
ok
8> gen_udp:connect(S, {8,8,8,8}, 53).
ok If I take a step down and try it with the closer to the metal Eshell V12.0.2 (abort with ^G)
1> {ok, S} = socket:open(inet, dgram).
{ok,{'$socket',#Ref<0.326745212.1599209476.194695>}}
2> socket:bind(S, any).
ok
3> socket:connect(S, #{family => inet, addr => {127,0,0,1}, port => 53}).
ok
4> socket:connect(S, #{family => inet, addr => {8,8,8,8}, port => 53}).
{error,einval}
5> socket:close(S).
ok
6> f(S), {ok, S} = socket:open(inet, dgram).
{ok,{'$socket',#Ref<0.326745212.1599209476.194735>}}
7> socket:bind(S, any).
ok
8> socket:connect(S, #{family => inet, addr => {8,8,8,8}, port => 53}).
ok
9> socket:connect(S, #{family => inet, addr => {127,0,0,1}, port => 53}).
ok
10> socket:connect(S, #{family => inet, addr => {8,8,8,8}, port => 53}).
ok It appears that Linux does something with the socket if it first connects to a loopback address, so it later can not connect to an address outside that interface, or? The man page for
But not quite freely, it seems... :-( The man page for
So I added some code to 3b> socket:connect(S, #{family => 0, addr => <<0,0>>}). i.e connect to What do do?
|
Wow, thank you for digging into this. Fwiw, the cost of closing and opening a new socket in my app is trivial since trying the second DNS server shouldn't happen that often. I took a quick look at the Linux source code for handling UDP connect() calls, but I have no insights to report. I'm not sure how to assist further, but please let me know if there's a way. |
I have tried to Google this, but get no good hits for There should be Linux forums to ask in, bug report tools to use. But I am not familiar to them. I dug some more... FreeBSD 10.4 (End Of Life) behaves as Linux. But not FreeBSD 11.4 (oldest supported), nor OpenBSD 6.9 (latest release). So FreeBSD has had the bug and fixed it, a while back. OpenBSD does not have it now. Might be worth mentioning for anyone reporting this to some Linux bug report tool... Unfortunately this place in the code is not prepared to re-open a socket, so the rewrite is non-trivial, but I will have a look at it... |
I did some digging to see what the error path is in the kernel: The
The way that I read this is that it returns The stack trace is:
Perhaps of interest is how simple the I'm using Linux 5.4, so I was wondering if anything had changed in Linux 5.14. A I personally think that going down the Thank you again for the time you're spending in addressing this. I'll post if I find any thing more when I get a next window of time today to look at this. |
Just to elaborate: Why connect an UDP socket?The reason that
|
It seems Linux and the BSD:s are a in disagreement here. Use case:
On the BSD:s, if I check sockname after 3) it is RoutableIP:EphemeralPort, and after 4) it is 127.0.0.1:EphemeralPort (the same EphemeralPort as after 3)) On Linux (Ubuntu 18.04, Linux-5.4.0-80), sockname after 4) is the same as after 3); RoutableIP:EphemeralPort, so Linux does not change the socket's source address when the connect address is changed. This is a quite different behaviour from the BSD:s, and only kind of works when changing to the loopback address (still; firewall and name server source address rules may get in the way). This also explains why changing from the loopback address does not work since it would not be possible to get a reply routed back to a loopback source address. The only documented reasons for EINVAL from connect(2) for the BSD:s are; the length of the address does not match the address family, and, the destination address for a TCP connect is a UDP broadcast or multicast address. Quite obvious argument errors independent of the socket state, which is the norm for EINVAL. So I think Linux is out on deep water here. Connecting an UDP socket to different addresses is an old feature. Now I have to meditate over where (Erlang or C, |
Sorry, accidentally pressed the Close button |
Then again, the simplest solution for us would be if Linux could fix this misbehaviour... |
This fixes the problem for legacy diff --git a/erts/emulator/drivers/common/inet_drv.c b/erts/emulator/drivers/common/inet_drv.c
index 6c722dd87f..d635c31712 100644
--- a/erts/emulator/drivers/common/inet_drv.c
+++ b/erts/emulator/drivers/common/inet_drv.c
@@ -1,7 +1,7 @@
/*
* %CopyrightBegin%
*
- * Copyright Ericsson AB 1997-2020. All Rights Reserved.
+ * Copyright Ericsson AB 1997-2021. All Rights Reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
@@ -1660,7 +1660,9 @@ static void *realloc_wrapper(void *current, ErlDrvSizeT size){
#endif
# define ANC_BUFF_SIZE INET_DEF_BUFFER/2 /* XXX: not very good... */
+
#ifdef HAVE_UDP
+
static int load_address(ErlDrvTermData* spec, int i, char* buf)
{
int n;
@@ -1722,7 +1724,12 @@ static int load_address(ErlDrvTermData* spec, int i, char* buf)
}
return i;
}
-#endif
+
+#ifdef __linux__
+static struct sockaddr disconnect_sa;
+#endif /* #ifdef __linux__ */
+
+#endif /* #ifdef HAVE_UDP */
#ifdef HAVE_SCTP
@@ -12345,6 +12352,13 @@ static udp_descriptor* sctp_inet_copy(udp_descriptor* desc, SOCKET s, int* err)
#ifdef HAVE_UDP
static int packet_inet_init()
{
+#ifdef __linux__
+ sys_memzero((char *)&disconnect_sa, sizeof(disconnect_sa));
+#ifdef AF_UNSPEC
+ disconnect_sa.sa_family = AF_UNSPEC;
+#endif /* #ifdef AF_UNSPEC */
+#endif /* #ifdef __linux__ */
+
return 0;
}
@@ -12616,6 +12630,14 @@ static ErlDrvSSizeT packet_inet_ctl(ErlDrvData e, unsigned int cmd, char* buf,
(desc->sfamily, &desc->remote, &buf, &len)) != NULL)
return ctl_xerror(xerror, rbuf, rsize);
+#ifdef __linux__
+ /* Workaround for Linux's misbehaviour to not
+ * change the source address when connecting
+ * a datagram socket to a new destination
+ */
+ sock_connect(desc->s, &disconnect_sa, sizeof(disconnect_sa));
+#endif /* #ifdef __linux__ */
+
code = sock_connect(desc->s,
(struct sockaddr*) &desc->remote, len);
if (IS_SOCKET_ERROR(code)) { Now over to how to handle this in |
Plain text mode email preferred, and no top-posting. A self-contained reproducer in C is enough, plus a general description of the presumed bug. There is/was a bugzilla, but people tend to ignore it. Alternatively, report it to your distribution using their preferred methods. |
…59' into maint * raimo/work-around-linux-dgram-reconnect/GH-5092/OTP-17559: Test API errors Validate sockaddr tighter Work around Linux DGRAM reconnect misbehaviour Implement setting native socket addresses Work around Linux DGRAM reconnect misbehaviour Test udp reconnect
Fix merged. Thank you for investigating this issue! |
That's great!!! Thank you so much for fixing it! |
…59' into maint * raimo/work-around-linux-dgram-reconnect/GH-5092/OTP-17559: Use 'unspec' to dissolve association Implement AF_UNSPEC
Describe the bug
I'm using the OTP resolver and it is configured with two DNS servers. The first DNS server was properly queried, but when it returned an error (not an nxdomain one, but a refusal to answer the query). The second DNS server wasn't queried. In tracking this down, the udp_connect function was returning
{error, einval}
.To Reproduce
To consistently reproduce the bug, I made the first DNS server always return an rcode 4 (not implemented). Looking at
inet_res.erl
, it looks like quite a few errors could be used to trigger the code to try the next DNS server in the list.Next I ran
inet_res:resolve("google.com", in, a, [{verbose, true}])
and watched the messages show the error from the first server, the "Try UDP server" print for the next server and then the immediateUDP send failed: {error, einval}
print. Upon looking into this further, the{error, einval}
didn't come from send, but fromudp_connect
.I first thought that
gen_udp:connect/3
was failing because the socket was closed. There are a couple calls to close the socket ininet_res.erl
, but none of those were being called.Removing the call to
gen_udp:connect/3
fixes the issue. I.e., the resolver tries the first DNS server, the first server returns an error, then the resolver tries the second DNS server, gets back a good response and returns it successfully. Whatever was causinggen_udp:connect
to return an error doesn't impactgen_udp:send
from working. In fact, I was thinking that removing the call togen_udp:connect/3
would be a fix to this issue but since I hadn't traced this to what was causingconnect
to fail, I wanted to raise the issue first. It could also be an issue withgen_udp:connect/3
.Expected behavior
inet_res:resolve
tries the second DNS server in the list when the first one refuses to answer the query.Affected versions
OTP 24.0.5
Additional information
Here's an example log that shows a failed run of
:init_res.resolve/4
. This runs quickly. There are no pauses or timeouts.The text was updated successfully, but these errors were encountered: